A Neural Network for Automated Image Quality Assessment of Optic Disc Photographs

This study describes the development of a convolutional neural network (CNN) for automated assessment of optic disc photograph quality. Using a code-free deep learning platform, a total of 2377 optic disc photographs were used to develop a deep CNN capable of determining optic disc photograph quality. Of these, 1002 were good-quality images, 609 were acceptable-quality, and 766 were poor-quality images. The dataset was split 80/10/10 into training, validation, and test sets and balanced for quality. A ternary classification model (good, acceptable, and poor quality) and a binary model (usable, unusable) were developed. In the ternary classification system, the model had an overall accuracy of 91% and an AUC of 0.98. The model had higher predictive accuracy for images of good (93%) and poor quality (96%) than for images of acceptable quality (91%). The binary model performed with an overall accuracy of 98% and an AUC of 0.99. When validated on 292 images not included in the original training/validation/test dataset, the model’s accuracy was 85% on the three-class classification task and 97% on the binary classification task. The proposed system for automated image-quality assessment for optic disc photographs achieves high accuracy in both ternary and binary classification systems, and highlights the success achievable with a code-free platform. There is wide clinical and research potential for such a model, with potential applications ranging from integration into fundus camera software to provide immediate feedback to ophthalmic photographers, to prescreening large databases before their use in research.


Introduction
Evaluation of the optic disc is an integral part of the diagnosis and management of glaucoma, as structural damage to the optic nerve can often be detected even before visual field defects are present [1,2]. Careful recognition of preperimetric morphological changes to the optic nerve is vital in monitoring disease progression [3][4][5]. Structural assessment of the optic disc can be accomplished with optical coherence tomography (OCT) and fundus photography. Subjective evaluation of optic disc photographs by clinicians is hampered by poor interobserver agreement [6][7][8][9]. Fundus photography is appropriate for both population screening and long-term follow-up, as it can be low cost and allows for direct qualitative comparison to previously captured images, even images decades old. This stands in contrast to OCT, where high cost may impede wide adoption and rapid technological innovation makes comparison with historical images difficult. Both types of testing require training to capture images of sufficient quality for use in diagnostics.
In fundus photography, as in all medical imaging, poor image quality may confound accurate assessment of the image and render images diagnostically useless. Various studies have estimated the occurrence of poor-quality fundus photographs of the optic nerve head at 3.7-25% of all images captured, meaning that automated detection of these images has immediate potential for clinical utility [10][11][12][13]. Automated image-quality assessment (IQA) is also useful as a first "quality-control" step in the research and development of other computer vision and artificial intelligence (AI) systems.
Increasingly, AI is being explored for use in the diagnosis and management of glaucoma through automated assessment of the optic disc or fundus photographs. Models developed for this task use a variety of machine learning techniques. Statistical classifiers such as support vector machine (SVM), random forest algorithms, and k-nearest neighbor are common in earlier models; recent research has focused more heavily on the use of deep learning techniques such as convolutional neural networks [14,15]. Automated detection of glaucoma is sometimes based on explicit recognition of pathological differences in certain ocular features, such as retinal nerve fiber layer defects, presence of disc hemorrhages, extent of peripapillary atrophy, or cup-to-disc ratio [16][17][18][19][20]. Other studies have relied on more complex inputs (i.e., entire images), which allows for less supervised and more holistic classification [21][22][23]. The large-scale fundus photograph databases required for developing the latter category of models in particular would benefit from a reliable automated system for screening out unusable or poor-quality images. Furthermore, there is research demonstrating the detrimental impact of poor-quality images on the diagnostic ability of these algorithms [24][25][26].
The vast majority of prior research on automated ocular IQA has focused on retinal imaging nonspecific to glaucoma. There is minimal research looking specifically at IQA for glaucoma, which has a unique set of diagnostic criteria based on the appearance of the optic disc, which requires certain anatomical structures be clearly visible. Furthermore, to our knowledge, all research in this area has used conventional machine learning pathways, which require significant expertise and computational power. In this study, we describe the development of a deep convolutional neural network (CNN) created by using a code-free deep learning platform for automated assessment of optic disc photograph quality.

Materials and Methods
The deep convolutional neural network was developed with a collection of 2377 disc photographs randomly selected from a database of fundus images captured between 1997 and 2020 at the Glaucoma Division of the UCLA Jules Stein Eye Institute. There were no exclusion criteria applied during image selection, meaning that multiple images from the same patient or the same eye taken on different dates could potentially be included. To minimize bias, if multiple images from the same patient were included, they appear in only one of the split sets-either the training, validation, or test sets.
A total of 2377 optic disc photographs from 1684 eyes of 1360 patients at a glaucoma subspecialty clinic were used to train, validate, and test a convolutional neural network in an 80/10/10 split. Totals of 1626 digitized scanned slides and 751 digital images were used. The slide images were captured with a Zeiss Fundus Flash 3 Camera on Kodachrome 25 film and digitized at a third-party company. The digital images were captured on either a Zeiss Fundus Flash 3 with Escalon Digital Back or a Zeiss FF450 with Digital Back. All images were converted to the same image type (JPEG) when input into the model.
Prior to grading the images, standard reference photographs were chosen by three glaucoma specialists to define the range of image quality acceptable at each level ( Figure 1). Using a web-based interface, images were sorted into three quality classes (good, acceptable, and poor) by one of three glaucoma specialists and subsequently used to develop the model. Our dataset contained 1002 good-quality images, 609 acceptable-quality images, and 766 poor-quality images.
A convolutional neural network was developed with Google Cloud AutoML, a codefree deep learning platform offered by Google. The AutoML platform allowed local data manipulation and subsequent upload of .csv files containing the cloud location of the image file and the grader-assigned image quality. The platform automates image preprocessing and the model training and selection process, including neural architecture search and hyperparameter tuning. To minimize overfitting, early stopping was enabled, which automatically terminated training when no more improvements could be made. Automated internal image preprocessing occurred when the image's smallest edge was greater than 1024 pixels. In this case, the entire image was scaled down so the smallest edge was 1024 pixels. Images with a small side less than or equal to 1024 pixels were not subject to preprocessing. A convolutional neural network was developed with Google Cloud AutoML, a codefree deep learning platform offered by Google. The AutoML platform allowed local data manipulation and subsequent upload of .csv files containing the cloud location of the image file and the grader-assigned image quality. The platform automates image preprocessing and the model training and selection process, including neural architecture search and hyperparameter tuning. To minimize overfitting, early stopping was enabled, which automatically terminated training when no more improvements could be made. Automated internal image preprocessing occurred when the image's smallest edge was greater than 1024 pixels. In this case, the entire image was scaled down so the smallest edge was 1024 pixels. Images with a small side less than or equal to 1024 pixels were not subject to preprocessing.
Of the 2377 image dataset, 80% (1897/2377) were used for training the model, 10% (242/2377) were used to validate the model, and 10% (263/2377) were used to test the model's performance. The same proportions (approximately 33% each) of good-, acceptable-, and poor-quality images were included in each dataset. The model's output was three prediction values, one corresponding to each quality level (good, acceptable, poor), ranging from 0-1 and summing to 1. For each image, the category with the highest score was taken as the model's predicted quality label.
We performed independent validation on a set of 292 randomly selected disc photographs that were not part of the 2377 images used to develop the model. In this round of validation, the model's performance was measured against a consensus-derived grade from three glaucoma specialists who evaluated the images independently. Consensus was defined as agreement between two of the three graders; there were no instances of disagreement between all three clinicians. Table 1 provides a tabulation of the demographic characteristics of the patients represented in this dataset. Fifty-six percent were female, and the average age at the time of the disc photograph was 64 years old. Patients were primarily Caucasian (54.6%); Asian and Black patients constituted 14.0% and 10.7% of the study population, respectively. Of the 2377 image dataset, 80% (1897/2377) were used for training the model, 10% (242/2377) were used to validate the model, and 10% (263/2377) were used to test the model's performance. The same proportions (approximately 33% each) of good-, acceptable-, and poor-quality images were included in each dataset. The model's output was three prediction values, one corresponding to each quality level (good, acceptable, poor), ranging from 0-1 and summing to 1. For each image, the category with the highest score was taken as the model's predicted quality label.

Results
We performed independent validation on a set of 292 randomly selected disc photographs that were not part of the 2377 images used to develop the model. In this round of validation, the model's performance was measured against a consensus-derived grade from three glaucoma specialists who evaluated the images independently. Consensus was defined as agreement between two of the three graders; there were no instances of disagreement between all three clinicians. Table 1 provides a tabulation of the demographic characteristics of the patients represented in this dataset. Fifty-six percent were female, and the average age at the time of the disc photograph was 64 years old. Patients were primarily Caucasian (54.6%); Asian and Black patients constituted 14.0% and 10.7% of the study population, respectively. Among the included eyes, 47.3% had a diagnosis of primary open angle glaucoma, and 22.3% were considered suspicious for glaucoma at the time of image capture. Results are presented for both the ternary classification system described in the Methods section, and the binary classification that results from combining the "good" and "acceptable" classes into one "usable" class, vs. "unusable", corresponding to label "poor" of the three-class model. Figure 2 shows confusion matrices for both outcomes.

Ternary Classification Model
Full performance metrics (per-class and macro average) for the three-class model are shown in Table 2. AUC was calculated by using a one vs. rest approach, which splits the multi-classification task into several binary classification tasks, (e.g., good vs. [acceptable,

Ternary Classification Model
Full performance metrics (per-class and macro average) for the three-class model are shown in Table 2. AUC was calculated by using a one vs. rest approach, which splits the multi-classification task into several binary classification tasks, (e.g., good vs. [acceptable, poor]), then averages the AUC for each class to arrive at a single value. Macroaveraging, in which all classes get equal weight, was used because the three quality classes were relatively balanced. AUC for the model was 0.98, and the CNN had an overall accuracy of 91%. Although per-class performance varied between different metrics, generally the model performed better on classification of "good" and "poor" images than on images of "acceptable" quality. Examples of images misclassified by the model are shown in Figure 3. Notably, all misclassifications were no more than one category removed from their actual quality label (i.e., no "good" images mistaken for "poor" and vice versa).

Binary Classification Model
Full classification metrics for the binary outcome model can be found in Table 3. Overall, performance was better on the binary outcome model than the three-class model, with an accuracy of 98.47%. Sensitivity and specificity were both high, at 98.91% and 97.47%, respectively. AUC was also high at 0.99. None of the collected demographic information varied significantly between the usable and unusable quality groups.  Females were 33% more likely to have images of poor quality than good quality (OR 1.33, 95% CI [1.10, 1.60]). There was no significant gender difference between the goodand acceptable-quality groups. On average, age at the time of disc photograph capture was significantly higher in images of acceptable (7.45 years, adjusted p < 0.01) and poor (7.71 years, adjusted p < 0.01) quality than in those of good quality.

Binary Classification Model
Full classification metrics for the binary outcome model can be found in Table 3. Overall, performance was better on the binary outcome model than the three-class model, with an accuracy of 98.47%. Sensitivity and specificity were both high, at 98.91% and 97.47%, respectively. AUC was also high at 0.99. None of the collected demographic information varied significantly between the usable and unusable quality groups.

Independent Validation
Independent validation on 292 randomly selected images not included in the 2377-image dataset used to develop the model was performed to further evaluate the robustness of the model. Based on a three-clinician consensus grading, there were 225 good-, 53 acceptable-, and 14 poor-quality images in this dataset. This distribution is likely more representative of the entire optic disc photograph database.
The model agreed with the clinician consensus in 84.9% of cases, with a kappa value of 0.66. This corresponded to a 97.3% agreement with clinicians (kappa = 0.65) when considered as a binary outcome task, usable vs. unusable.

Discussion
The proposed system for automated IQA on optic disc fundus photographs achieves high accuracy for both the three-class task (90.84%) and the binary outcome task (98.47%), demonstrating that a CNN is able to provide a high degree of discrimination between images of good, acceptable, and poor quality. There is a wide breadth of literature concerning IQA and its utility in nonglaucoma pathologies, but there is comparatively little specific to glaucoma. We will be discussing both to contextualize our model and its performance.

Binary Approach-Non-Glaucomatous Fundus Photographs
In order to perform retinal IQA as a preliminary step in diabetic retinopathy screening, Saha et al. fine-tuned a pretrained AlexNet model to classify images from patients with diabetic retinopathy into the categories "accept" and "reject" [27]. AUC, sensitivity, specificity, and accuracy were all 100%. However, the authors excluded all "ambiguous" images, defined by a lack of agreement in image quality among the graders, from the training dataset. When these images were included, the model's accuracy dropped to 54.5%, highlighting a major limitation of in the model's potential utility in a clinical setting. Similarly, Zago et al. adapted the Inception v3 architecture to the task of fundus IQA, again for eyes with retinal pathology [28]. The model's performance was evaluated by using interdataset cross-validation on two publicly available datasets, DRIMDB and ELSA-Brasil. The authors used image augmentation to synthesize poor-quality images, which were poorly represented in both datasets, and achieved a high sensitivity and specificity (97% and 100%, respectively, on DRIMDB; 92% and 96% on ELSA-Brasil) in the cross-validation. Another recent paper by Chalakkal et al. described a unique approach to IQA [29]. Their model first uses a deep learning classifier based only on image-quality markers such as sharpness and illumination, then performs a second round of unsupervised classification on images the initial classifier marked as good quality. Good quality in the second stage is defined by the presence of certain structural features of the retina, such as the optic disc, fovea, and macula. The authors report successful outcomes by using this approach, with an overall accuracy of 97.47%.

Binary Approach-Glaucomatous Optic Disc Photographs
Despite the success of the aforementioned models, they are not readily applicable to images of glaucomatous optic discs due to the domain-specific way in which image quality is defined. Diagnostically relevant pathological optic nerve head features such as vascular abnormalities, parapapillary atrophy, and disc hemorrhages can be difficult for a neural network to distinguish from poor-quality images or images with artifacts, and as such, it is important to have a model trained on images of glaucomatous discs [13,[30][31][32]. Retinal IQA models traditionally focus on vessel visibility as the most important grading criterion, while IQA for imaging performed on glaucoma patients would be better served by an assessment of image quality in a region dominated by the optic disc [33]. There has been relatively little research performed specifically on this disease population, perhaps due in part to the scarcity of large, disease-specific, quality-labeled image databases [34].
Mahapatra et al. attempted to combat these gaps in large, publicly available datasets by first extracting 30,000 150 × 150 pixel patches from the 101 images in the DRISHTI glaucoma dataset, then using image augmentation (adding noise, altering contrast, etc.) to artificially generate poor-quality images, as the original dataset contained none [35]. They saw high accuracy (99.87%) by using these methods, and to the best of our knowledge were the first group to build a CNN-based IQA system specific to glaucomatous optic disc photographs. The next group to do so, Zapata et al., used 150,075 quality-labeled retinal images to train a model to discern good vs. poor image quality, with an overall accuracy of 92% [36]. It is unclear what portion of these images were from patients with glaucoma; 3776 of their total 306,302 fundus photographs were considered to have "referable glaucomatous optic neuropathy". The authors defined good-quality images as those centered on the macula with good focus, good visualization of the parafoveal vessels and at least two disc diameters around the fovea, and ability to visualize at least three quarters of the optic disc and the vascular arcades. This definition highlights our earlier point regarding disease-specific criteria for image quality-in our study, as well as in a 2020 study by Bhatkalkar et al., visualization of the entire optic disc was necessary for the graders to consider an image of good quality [37]. The lack of a standardized metric for defining image quality makes direct comparison between models difficult, a pervasive theme in the domain of IQA. Bhatkalkar's model, which was trained by using three publicly available image databases along with their own hospital database and was tested on three small public datasets, had an accuracy ranging from 96.4% to 100%. Notably, their model was trained on fundus photograph datasets of patients with diabetic retinopathy and age-related macular degeneration, and thus is not generalizable to the assessment of IQA on patients with glaucomatous damage to the optic nerve. However, it does provide an indication of the success that may be achievable in such populations.
Our model achieved performance similar to that reported for other IQA systems. As all the reported glaucoma-specific IQA systems provide binary outputs, we will be directly comparing the performance of our binary model. Our binary classification model shows comparable accuracy to the models developed by Mahapatra, Zapata, and Bhatkalkar (98.5% vs. 99.87%, 92%, and 96.4-100%, respectively). We report sufficiently high sensitivity (98.9) and specificity (97.5) values which are on par with those reported by Mahapatra (100, 99.8) and Zapata (96.9, 81.8). These results are also consistent with the high level of success achieved by other groups developing similar models for IQA of patients afflicted with diabetic retinopathy, age-related macular degeneration, and other retinal pathologies.
Our study is unique for several reasons, most notably the size of our training dataset and the balanced sample sizes of the three quality classes. We trained, tested, and validated the model by using a total of 2377 images, which to the best of our knowledge is the largest dataset of quality-labeled glaucomatous disc photographs used in the development of such a model. Due to the scarcity of large databases of images from glaucomatous eyes, other studies have used smaller datasets, ranging in size from 99 to 397 images [35,37]. Moreover, the majority of publicly available datasets contain insufficient proportions of poor-quality images, which can result in lower predictive accuracy for the underrepresented class. As an extreme example, in the dataset used by Saha et al., only 143 out of the total 3577 images (4%) were of "reject" quality; given this distribution, a naïve model that only outputs "accept" will achieve 96% accuracy [27]. Some groups relied on image augmentation to artificially generate poor quality images and overcome this limitation, the success of which may not be successfully transferred to clinical practice [35]. Our model's performance also highlights the success achievable on a code-free machine learning platform, which can significantly reduce the in-house computational power and deep-learning expertise that is typically required when developing a neural network [38].

Multiclass Approach
Although binary classification models like those previously described make up the majority of current IQA systems, recent literature has suggested that there may be a need for a more granular classification system with more than two levels of quality. We agree that a multiclass model addresses one of the primary limitations of a binary model, namely that it has difficulty classifying images of borderline quality [13]. By denoting an intermediate "acceptable" category, we attempted to keep such images from being mixed with the highestquality images. There is a small amount of existing literature that relies on a multilevel grading system similar to our proposed three-class methodology. In a 2019 study by Fu et al., the authors define three levels of image quality: "good" (high-quality images with all retinopathy characteristics visible), "usable" (low contrast, blur, poor illumination, or artifacts, but main structures still identifiable by ophthalmologists), and "reject" (unreliable diagnostically) [25]. They report comparable accuracy (91.75%) to our ternary classification model. Other researchers have suggested parsing out image-quality categories even further to increase clinical utility-for example, Wang et al. define five levels of quality (adequate, just noticeable blur, incomplete optic disc, under/over-illumination, and opacity) [39]. By using a 121-layer, 4 block DenseNet architecture, they applied transfer learning and finetuned the model to suit their five image-quality classes. Their model was relatively accurate (92.7% overall accuracy with per-class accuracy 77-100%) in discrimination between the four image degradation categories, but the wide range in per-class accuracy illustrates the challenges faced when shifting from binary to multiclass classification systems. The authors saw better outcomes in their binary classification model (good vs. poor, accuracy 97.2% and 98.2% on the two datasets used). We saw this reflected in our model's performance as well, with overall accuracy increasing when we adapted our three-level classification to a binary system.
Although our model demonstrated a relatively high level of accuracy in both the binary and multiclass forms, it is not without limitations. As previously mentioned, binary models have difficulty classifying images of borderline quality. This is not unique to our model, nor is it unique to neural networks. Image quality, even when precisely defined, is a subjective measure on which even trained ophthalmologists often disagree [27]. We attempted to address this with a three-class model in the hopes of isolating the borderline quality images, which investigators can choose to use or discard, depending on the task at hand. Additionally, for our model to be more useful in widespread population screening, it should be tested on images captured from a wider variety of cameras, particularly nonmydriatic fundus cameras, portable cameras, and ophthalmic lens attachments for smartphones. Additional validation, particularly on an external dataset, would better assess the robustness and generalizability of our model. However, there is a relative lack of public datasets of glaucomatous optic disc photographs, and an absolute lack of quality-labeled datasets, both of which limit our ability to externally evaluate the model's performance [34].
There is immediate potential for clinical and research utility of an IQA CNN such as ours. It has been suggested that models such as the one described in this paper can be integrated with the software of the camera capturing the fundus photographs to provide real-time feedback to photographers and allow for timely recapture of any unusable images [13]. Such an integration has the potential to improve clinical efficiency and is increasingly relevant with the advent of telemedicine in ophthalmology, where a trained ophthalmologist might not immediately be evaluating the images.
Although binary models may be more immediately applicable to clinical practice as they provide a simple outcome of diagnostically usable or unusable, a multiclass model may be better suited to certain tasks. For example, when examining serial disc photographs, image artifacts (media opacities, dust on the camera lens, patient eyelashes, lens flare, etc.) presumably present at a higher rate in the "acceptable" category may hinder accurate detection of real glaucomatous progression. Furthermore, as artificial intelligence for diagnostics and screening is increasingly the subject of ophthalmologic research, the development of large-scale fundus photograph databases is of paramount importance. Many studies have demonstrated that inclusion of low-quality images greatly decreases the accuracy of these diagnostic algorithms and limits their deployability into clinical practice [26]. A multiclass IQA network would directly benefit research efforts by allowing researchers to quickly select only appropriate quality images for use in training diagnostic models.
Future research should focus on further improving both binary and multiclass classification of optic disc IQA. Alternatively, the development of a model that outputs a continuous quality grade could allow even further control when selecting a threshold for inclusion as an image of "good" quality, offering the greatest flexibility for project-specific quality needs. Additionally, the development of a system capable of outputting the reason for assigning a "poor" quality grade (e.g., optic disc not completely in view, poor contrast, not focused, media opacity, etc.) could further enhance the clinical utility of an IQA algorithm and allow photographers to adjust their technique in real time.

Conclusions
Image quality is an important consideration when automating the process of disease diagnosis, which is a growing focus of glaucoma research. Poor-quality images are often excluded from datasets used to train machine learning models, a highly time-consuming process when the dataset size is in the order of hundreds of thousands. Moreover, when these diagnostic models are ultimately implemented into clinical practice, the presence of poor-quality images may limit their ability to correctly identify the presence of glaucomatous damage to the optic disc. Thus, a tool such as the proposed IQA model has clear applications to the development and implementation of AI-supported glaucoma diagnosis.
The proposed method classifies an optic disc photograph into one of three classes based on image quality: good, acceptable, and poor. We report outcomes for this multi-class model as well as a binary classification system where images of "good" and "acceptable" quality are all considered part of a "usable" image class. Our model, developed on a codefree deep learning platform, is highly accurate, agreeing with human graders 90.8% (threeclass model) and 98.5% (binary model) of the time. These rates are comparable to several existing methods for fundus IQA and provide evidence not only for the feasibility and utility of automated IQA of glaucomatous disc photographs, but also for the effectiveness of a code-free platform usable by nonexperts for developing neural networks.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of the University of California, Los Angeles (UCLA) (IRB#19-001850).
Informed Consent Statement: Patient consent was waived because the research involves no more than minimal risk to the subjects and the research could not be practicably carried out without the waiver due to the large number of subjects included.