Replication study: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs

Replication studies are essential for validation of new methods, and are crucial to maintain the high standards of scientiﬁc publications, and to use the results in practice. We have attempted to replicate the main method in Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs published in JAMA 2016; 316(22)[1]. We re-implemented the method since the source code is not available, and we used publicly available data sets. The original study used non-public fundus images from EyePACS and three hospitals in India for training. We used a diﬀerent EyePACS data set from Kaggle. The original study used the benchmark data set Messidor-2 to evaluate the algorithm’s performance. We used the same data set. In the original study, ophthalmologists re-graded all images for diabetic retinopathy, macular edema, and image gradability. There was one diabetic retinopathy grade per image for our data sets, and we assessed image gradability ourselves. The original study did not provide hyper-parameter settings. But some of these were later published. We were not able to replicate the original study, due to insuﬃcient level of detail in the method description. Our best eﬀort of replication resulted in an algorithm that was not able to reproduce the results of the original study. Our algorithm’s area under the receiver operating characteristic curve (AUC) of 0.94 on the Kaggle EyePACS test set and 0.80 on Messidor-2 did not come close to the reported AUC of 0.99 on both test sets in the original study. This may be caused by the use of a single grade per image, diﬀerent data, or diﬀerent not described hyper-parameters. We conducted experiments with various normalization methods


Introduction
Being able to replicate a scientific paper by strictly following the described methods is a cornerstone of science.Replicability is essential for the development of medical technologies based on published results.However, there is an emerging concern that many studies are not replicable, raised for bio-medical research [2], computational sciences [3,4], and recently for machine learning [5].
The terms replicate and reproduce are often used without precise definition, and sometimes referring to the same concept.We will distinguish the two in the scientific context by defining replication as repeating the method as described and reproducing as obtaining the result as reported.The scientific standard for publication is to describe the method with sufficient detail for the study to be repeated, and a replication study is then the attempt to repeat the study as described, not as it might have been conducted.To balance the level of detail and readability of a manuscript, a replication attempt must follow the standard procedures of the field when details about the method are missing.
If the data that produced the reported results are available, the replicated method should reproduce the original results, and any deviation points towards a lack of detail in the description of the method, assuming that the replication is conducted correctly.If the data are not available, deviations in the results can be due to either insufficient description of the method, or differences in the data, and it will not be possible to separate the two sources of deviation.This should not prevent replication studies from being conducted, even if the conclusions regarding replicability are less definite than if the data were available.
Deep learning has become a hot topic within machine learning due to its promising performance of finding patterns in large data sets.There are dozens of libraries that make deep learning methods easily available for any developer.This has consequently led to an increase of published articles that demonstrate the feasibility of applying deep learning in practice, particularly for image classification [6,7].However, there is an emerging need to show that studies are replicable, and hence be used to develop new medical analysis solutions.Ideally, the data set and the source code are published, so that other researchers can verify the results by using the same or other data.However, this is not always practical, for example for sensitive data, or for methods with commercial value [4,5].
In this study, we make an assessment on the replicability of a deep learning method.We have chosen to attempt to replicate the main method from Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs, published in JAMA 2016; 316 (22) [1].As of April 2018, this article had been cited 350 times [8].We chose to replicate this study because it is a well-known and high-impact study within the medical field, the source code has not been published, and there are as far as we know not any others who have attempted to replicate this study.
Deep learning methods have been used in other publications for detection of diabetic retinopathy [9,10,11] with high performance.Direct comparison between these algorithms is not possible, since they use different data sets for evaluation of performance.However, these papers confirm the main findings of [1]; that deep learning can be used to automatically detect diabetic retinopathy.We assess [1] to be most promising regarding replicability, due to its enhanced focus on method and more detailed descriptions.
The original study describes an algorithm (hereby referred to as the original algorithm) for detection of referable diabetic retinopathy (rDR) in retinal fundus photographs.The algorithm is trained and validated using 118 419 fundus images retrieved from EyePACS and from three eye hospitals in India.The original algorithm's performance was evaluated on 2 test sets, and achieved an area under the receiver operating characteristic curve (AUC) for detecting rDR of 0.99 for both the EyePACS-1 and the Messidor-2 test sets.Two operating points were selected for high sensitivity and specificity.The operating point for high specificity had 90.3% and 87.0% sensitivity and 98.1% and 98.5% specificity for the EyePACS-1 and Messidor-2 test sets, whereas the operating point for high sensitivity had 97.5% and 96.1% sensitivity and 93.4% and 93.9% specificity, respectively.
To assess replicability of the method used to develop the original algorithm for detection of rDR, we used similar images from a publicly available EyePACS data set for training and validation, and we used a subset from the EyePACS data set and images from the public Messidor-2 data set for performance evaluation.We had to find validation hyper-parameters and the image normalization method ourselves, because they were not described in the original study.Our objective is to compare the performance of the original rDR detection algorithm to our result algorithm after trying to replicate, taking into account potential deviations in the data sets, having fewer grades, and potential differences in normalization methods and other hyper-parameter settings.
We were not able to replicate the original study, due to insufficient level of detail in the method description.Our best effort of replication resulted in an algorithm that was not able to reproduce the results of the original study.Our algorithm's AUC for detecting rDR for our EyePACS and Messidor-2 test sets were 0.94 and 0.80, respectively.The operating point for high specificity had 83.4% and 67.9% sensitivity and 90.1% and 76.4% specificity for our EyePACS and Messidor-2 test sets, and the operating point for high sensitivity had 89.9% and 73.7% sensitivity and 83.8% and 69.7% specificity.The results can differ for four reasons.First, we used public retinal images with only one grade per image, whereas in the original study the non-public retinal images were re-graded multiple times.Second, the now published list of hyper-parameters used in the original study [12] lack details regarding the normalization method and the validation procedure used, so the original algorithm may have been tuned better.Third, there might be errors in the original study or methodology.The last possible reason is that we may have done something wrong with replicating the method by having misinterpreted the methodology.We do not know for sure which of the four reasons has led to our considerably worse performance.
We do not believe our results invalidate the main findings of the original study.However, our result gives a general insight into the challenges of replicating studies that do not use publicly available data and publish source code, and it motivates the need for additional replication studies in deep learning.We have published our source code with instructions for how to use it with public data.This gives others the opportunity to improve upon the attempted replication.

Data sets
The data sets consist of images of the retinal fundus acquired for diabetic retinopathy screening.Any other information regarding the patient is not part of the data sets.Each image is graded according to severity of symptoms (see Section 2.2).
The original study obtained 128 175 retinal fundus images from EyePACS in the US and from three eye hospitals in India.118 419 macula-centered images from this data set were used for algorithm training and validation (referred to as development set, divided into training and tuning set in the original study).To evaluate the performance of the algorithm, the original study used two data sets (referred to as validation sets in the original study).For evaluating an algorithm's performance, the term test set is commonly used.The first test set was a randomly sampled set of 9963 images retrieved at EyePACS screening sites between May 2015 and October 2015.The second test set was the publicly available Messidor-2 data set [13,14], consisting of 1748 images.We provide an overview of the differences in image distribution used in our replication study compared with the original study in Figure 2.
We obtained images for training, validation and testing from two sources: EyePACS from a Kaggle competition [15], and the Messidor-2 set that was used in the original study.The Messidor-2 set is a benchmark for algorithms that detect diabetic retinopathy.We randomly sampled the Kaggle EyePACS data set consisting of 88 702 images into a training and validation set of 57 146 images and a test set of 8790 images.The leftover images were mostly images graded as having no diabetic retinopathy and were not used for training the algorithm.The reason for the number of images in our training and validation set is to keep the same balance for the binary rDR class as in the original study's training and validation set.Our EyePACS test set has an identical amount of images and balance for the binary rDR class as in the original study's EyePACS test set.We used all 1748 images from the Messidor-2 test set.

Grading
The images used for the algorithm training and testing in the original study were all graded by ophthalmologists for image quality (gradability), the presence of diabetic retinopathy, and macular edema.We did not have grades for macular edema for all our images, so we did not train our algorithm to detect macular edema.Kaggle [16] describes that some of the images in their EyePACS distribution may consist of noise, contain artifacts, be out of focus, or be over-or underexposed.[17] states further that 75% of the EyePACS images via Kaggle are estimated gradable.For this study one of the authors (MV) graded all Kaggle and Messidor-Original images on their image quality with a simple grading tool (Figure 1).MV is not a licensed ophthalmologist, but we assume fundus image quality can be reliably graded by non-experts.We used the "Grading Instructions" in the Supplement of the original study to assess image quality.We publish the image quality grades with the source code.Images of at least adequate quality were considered gradable.
In the original study, diabetic retinopathy was graded according to the International Clinical Diabetic Retinopathy scale [18], with no, mild, moderate, severe or proliferative severity.
The Kaggle EyePACS set had been graded by one clinician for the presence of diabetic retinopathy using the same international scaling standard as used in the original study.We have thus only one diagnosis grade for each image.Kaggle does not give more information about where the data is from.The Messidor-2 test set and its diabetic retinopathy grades were made available by Ambramoff [19].

Algorithm training
The objective of this study is to assess replicability of the original study.We try to replicate the method by following the original study's methodology as accurately as possible.As in the original study, our algorithm is created through deep learning, which involves a procedure of training a neural network to perform the task of classifying images.We trained the algorithm with the same neural network architecture as in the original study: the InceptionV3 model proposed by Szegedy et al [20].This neural network consists of a range of convolutional layers that transforms pixel intensities to local features before converting them into global features.
The fundus images from both training and test sets were preprocessed as described by the original study's protocol for preprocessing.In all images the center and radius of the each fundus were located and resized such that each image gets a height and width of 299 pixels, with the fundus center in the middle of the image.A later article reports a list of data augmentation and training hyper-parameters for the trained algorithm in the original study [12].We applied the same data augmentation settings in our image preprocessing procedure.
The original study used distributed stochastic gradient descent proposed by Dean et al [21] as the optimization function for training the parameters (i.e.weights) of the neural network.This implies that their neural network was trained in parallel, although the paper does not describe it.We did not conduct any distributed training for our replica neural network.According to the hyper-parameters published in [12], the optimization method that was used in the original study was RMSProp.Therefore, we used RMSProp as our optimization procedure.The hyper-parameter list specifies a learning rate of 0.001, so we used this same learning rate for our algorithm training.We furthermore applied the same weight decay of 4 * 10 −5 .
As in the original study, we used batch normalization layers [22] after each convolutional layer.Our weights were also pre-initialized using weights from the neural network trained to predict objects in the ImageNet data set [23].
The neural network in the original study was trained to output multiple binary predictions: 1) whether the image was graded moderate or worse diabetic retinopathy (i.e.moderate, severe, or proliferative grades); 2) severe or worse diabetic retinopathy; 3) referable diabetic macular edema; or 4) fully gradable.The term referable diabetic retinopathy was defined in the original study as an image associated with either or both category 1) and 3).For the training data obtained in this replication study, only grades for diabetic retinopathy were present.That means that our neural network outputs only one binary prediction: moderate or worse diabetic retinopathy (referable diabetic retinopathy).
In this study, the training and validation sets were split like in the original study: 80% was used for training and 20% was used for validating the neural network.It is estimated that 25% of the Kaggle EyePACS set consists of ungradable images [17].Therefore, we also assessed image gradability for all Kaggle EyePACS images, and we trained an algorithm with only gradable images.In the original study, the performance of an algorithm trained with only gradable images was also summarized.We do not use the image quality grades as an input for algorithm training.
Specific details on the image normalization method or hyper-parameters of the validation procedure were not specified, so we conducted experiments to find the normalization method and hyper-parameter settings that worked well for training and validating the algorithms.We trained with three normalization methods: 1) image standardization, which involves subtracting the mean from each image and dividing each image by the standard deviation; 2) normalizing images to a [0, 1] range; and 4) normalizing images to a [-1, 1] range.

Algorithm validation
We validate the algorithm by measuring the performance of the resulting neural network by the area under the receiver operating characteristic curve (AUC) on a validation set, as in the original study.We find the area by thresholding the network's output predictions, which are continuous numbers ranging from 0 to 1.By moving the operating threshold on the predictions, we obtain different results for sensitivity and specificity.We then plot sensitivity against 1-specificity for 200 thresholds.Finally, the AUC of the validation set is calculated, and becomes an indicator for how well the neural network detects referable diabetic retinopathy.The original study did not describe how many thresholds were used for plotting AUC, so we used the de facto standard of 200 thresholds.
The original paper describes that the AUC value of the validation set was used for the earlystopping criterion [24]; training is terminated when a peak AUC on the validation set is reached.This prevents overfitting the neural network on the training set.In our validation procedure, we also use the AUC calculated from the validation set as an early stopping criterion.To determine if a peak AUC is reached, we compared the AUC values between different validation checkpoints.
To avoid stopping at a local maximum of the validation AUC function, our network may continue to perform training up to n epochs (i.e.patience of n epochs).Since the original paper did not describe details regarding the validation procedure, we had to experiment with several settings for patience.One epoch of training is equal to running all images through the network once.
We used ensemble learning [25] by training 10 networks on the same data set, and using the final prediction computed by taking the mean of the predictions of the ensemble.This was also done in the original study.
In the original study, additional experiments were conducted to evaluate the performance of the resulting algorithm based on the training set, compared with performance based on subsets of images and grades from the training set.We did not replicate these experiments for two reasons.First, we chose to focus on replicating the main results of the original paper.That is, the results of an algorithm detecting referable diabetic retinopathy.Second, we cannot perform subsampling of grades, as we only have one grade per image.

Results
As for the early-stopping criterion at a peak AUC, we found that a patience of 10 epochs worked well.Our chosen requirement for a new peak AUC was a value of AUC that is larger than the previous peak value, with a minimum difference of 0.01.The normalization method of normalizing the images to a [-1, 1] range outperformed the other normalization methods.
The replica algorithm's performance was evaluated on two independent test sets.We provide an overview of the differences in image distribution used in our replication study compared with the original study in Figure 2 in Section 2.2.Our replica algorithm trained with normalizing images to a [-1, 1] range yielded an AUC of 0.94 and 0.80 on our Kaggle EyePACS test data set and Messidor-2, respectively (Figure 4 and Table 1).We observe that there is a large discrepancy between the AUC of our replication study and the original study.Our results for training with the other conducted normalization methods are also shown in Table 1. Figure 5 and 6 show their corresponding receiver operating characteristic curves.Lastly, we attempted training by excluding non-gradable images, but this has shown to not increase algorithm performance.

Test set
High sensitivity High specificity AUC score

Discussion
The results show substantial performance differences between the original study's algorithm and our replica algorithm.Even though we followed the methodology of the original study as closely as possible, our algorithm did not come close to the results in the original study.This is probably because our algorithms were trained with different public data, under different hyper-parameters, and because in the original study ophthalmologic experts re-graded all their images.According to the original study, the validation and test sets should have multiple grades per image, because it will provide a more reliable measure of a model's final predictive ability.Their results on experimenting with only one grade per image show that their algorithm's performance declines with 36%.
The hyper-parameters were not published when we started this replication study.Later, hyperparameters for training and data augmentation were published in [12], and then we retrained all algorithms with these hyper-parameters and data augmentation settings.However, some of the details for the methods in the original study remain unspecified.First, the hyper-parameter settings for the validation procedure and the used normalization method are missing.Second, it is unclear how the algorithm's predictions for diabetic retinopathy or macular edema are interpreted in case of ungradable images.The image quality grades might have been used as an input for the network, or the network might be concatenated with another network that takes the image quality as an input.Third, apart from the main algorithm that detects referable diabetic retinopathy and outputs 4 binary classifications, other algorithms seem to have been trained as well.An example is the described algorithm that only detects referable diabetic retinopathy for gradable images, and an algorithm that detects all-cause referable diabetic retinopathy, which presents moderate or worse diabetic retinopathy, referable macular edema, and ungradable images.Details on how these other algorithms are built are however not reported.It is unclear whether the main network has been used or if the original study trained new networks.Lastly, the original paper did not state how many iterations it took for their proposed model to converge during training, or describe how to find a converging model.
Our results show that picking the appropriate normalization method is essential.From the three different methods we trained with, image normalization to a [-1, 1] range turned out to be the best performing method and is therefore assumed to be used in the original study.This is also likely due to the fact that the pre-trained InceptionV3 network was trained with [-1, 1] normalized ImageNet images.

Hyper-parameters
The main challenge in this replication study was to find hyper-parameters, which were not specified in the original paper, such that the algorithm does not converge on a local maximum of the validation AUC function.To understand how we should adjust the hyper-parameters, we measured the Brier score on the training set and the AUC value on the validation set after each epoch of training.One possible reason for the algorithm having problems to converge may be the dimensions of the fundus images.As the original study suggests, the original fundus images were preprocessed and scaled down to a width and height of 299 pixels to be able to initialize the InceptionV3 network with ImageNet pre-trained weights, which have been trained with images of 299 by 299 pixels.We believe it is difficult for ophthalmologists to find lesions in fundus images of this size, so we assume the algorithm has difficulties with detecting lesions as well.[17] also points out this fact, and suggests re-training an entire network with larger fundus images and randomly initialized weights instead.And as mentioned before, it seems like the original study extended the InceptionV3 model architecture for their algorithm to use image gradability as an input parameter.

Kaggle images
A potential drawback with the images from Kaggle is that it contains grades for diabetic retinopathy for all images.We found that 19.9% of these images is ungradable, and it is thus possible that the algorithm will "learn" features for ungradable images, and make predictions based on anomalies.This is likely to negatively contribute to the algorithm's predictive performance, but we were not able to show a significant difference of performance between an algorithm trained on all images and an algorithm trained on only gradable images.

Conclusion
We re-implemented the main method from JAMA 2016; 316( 22), but we were not able to get the same performance as reported in that study using publicly available data.The main identified sources for deviation between the original and the replicated results are hyper-parameters, and quality of data and grading.From trying several normalization methods, we found that the original study most likely normalized images to a [-1, 1] range, because it yielded the best performing algorithm, but its results still deviate from the original algorithm's results.We assume the impact of the hyper-parameters to be minor, but there is no standard setting for hyper-parameters, and we therefore regard this as missing level of detail.The original study had access to data of higher quality than those that are publicly available, and this is likely to account for part of the deviation in results.Gulshan et al showed that the performance levels off around 40 000 (Figure 4A), and we therefore assume that the reduced size of replication data set is not a large source for deviation in the results.The number of grades per image is a possible explanation of Gulshan et al's superior results, but the impact is uncertain.Figure 4B depicts performance as a function of grades, but there is an overfitting component: 100% vs. 65% specificity for the training and test set, respectively, and it is not possible to distinguish the contribution from the overfitting from that of the low number of grades.
The source code of this replication study and instructions for running the replication are available at https://github.com/mikevoets/jama16-retina-replication.

Figure 1 :
Figure 1: Screenshot of grading tool used to assess gradability for all images.

Figure 2 :
Figure 2: Data set distribution in original study vs. this replication study.

Figure 3 :
Figure 3: Examples of ungradable images because they are either out of focus, under-, or overexposed.

Figure 4 :
Figure 4: Area under receiver operating characteristic curve for training with normalizing images to a [-1, 1] range.

Figure 5 :
Figure 5: Area under receiver operating characteristic curve for training with standardizing images.

Figure 6 :
Figure 6: Area under receiver operating characteristic curve for training with normalizing images to a [0, 1] range.

Table 1 :
Performance on test sets of replication with various normalization methods, compared to results from the original study.The results of the original study are depicted in parenthesizes.