Severity Classification of Conjunctival Hyperaemia by Deep Neural Network Ensembles

Conjunctival hyperaemia is a common clinical ophthalmological finding and can be a symptom of various ocular disorders. Although several severity classification criteria have been proposed, none include objective severity criteria. Neural networks and deep learning have been utilised in ophthalmology, but not for the purpose of classifying the severity of conjunctival hyperaemia objectively. To develop a conjunctival hyperaemia grading software, we used 3700 images as the training data and 923 images as the validation test data. We trained the nine neural network models and validated the performance of these networks. We finally chose the best combination of these networks. The DenseNet201 model was the best individual model. The combination of the DenseNet201, DenseNet121, VGG19, and ResNet50 were the best model. The correlation between the multimodel responses, and the vessel-area occupied was 0.737 (p < 0.01). This system could be as accurate and comprehensive as specialists but would be significantly faster and consistent with objective values.


Introduction
Conjunctival hyperaemia is one of the most common findings in ophthalmologic practice. It is routinely described as a symptom of many ocular diseases such as conjunctivitis, uveitis, elevated intraocular pressure due to glaucoma, and ophthalmic side effects. For example, conjunctival hyperaemia is a minor side effect of glaucoma eye drops, but it becomes relatively important when medication adherence is considered. Most complaints of eye drop-related conjunctival hyperaemia are regarding aesthetics, but patients' dislike of their eyes' appearance can significantly impact their need to continue their medication [1,2]. Several clinical studies have been conducted to assess conjunctival hyperaemia reactions after glaucoma eye drop instillation [3,4], but a critical variable in these studies is the determination of the severity of conjunctival hyperaemia.
At present, conjunctival hyperaemia is classified by severity according to the McMonnies and Chapman-Davies scale [5], Institute for Eye Research scale [6], Efron scale [7], a validated bulbar redness scale [8], and the Japan Ocular Allergy Society (JOAS) conjunctival hyperaemia severity grading scale [9]. However, all of these grading systems are purely subjective [10]. In the aforementioned clinical studies, the JOAS system was used; in it, clinicians use standardised photographs to grade the degree of dilation of the conjunctival blood vessels causing hyperaemia on a 4-point scale that includes no hyperaemia. is severity grading is used in clinical studies of the aforementioned glaucoma eye drops [3,4]. Yoneda et al. invented an analytical application dedicated to conjunctival imaging to establish an objective grading system [11,12]. In their application, the area occupied by the blood vessels is obtained from images captured by a dedicated conjunctival imaging system. However, Yoneda admits that it is necessary to simplify the application before it can be used in clinical practice [11].
Recently, a supervised machine learning system known as neural network [13] and its algorithms are gaining attention. In particular, in medical research, the deep neural network, which uses many convolution layers [14], has been applied. In ophthalmology, its use has been validated in reports on diabetic retinopathy, glaucoma, age-related macular degeneration, and retinal detachment [15][16][17][18][19]. e imaging devices used to train the machines are also diverse, including a fundus camera, an optical coherence tomographic system, and a wide-angle fundus camera. e advantage of diagnostic and judgement systems using deep learning is the range of their adaptability. For example, using convolutional layers, features can be grasped without the effects of slight noise [20][21][22]. In addition, although a large amount of computation is required for the learning process, actual grading is performed by a simplified four-rule computation.
us, a large computing capacity is ultimately unnecessary, and even a small device can be used for verification [23].
Although a clinically useful system that automatically performs hyperaemia grading by deep learning is theoretically possible, to our knowledge, it has not been attempted yet.
Here, we attempted to develop a system that performs as well as ophthalmology specialists using standard slit photographs to teach a deep neural network the conjunctival hyperaemia severity grading of the JOAS.

Materials and Methods
e Japan Ocular Allergy Society's conjunctival hyperaemia severity grading system (hereafter "JOAS grading") 9 is a system to classify the degree of dilation of conjunctival blood vessels in spherical conjunctiva into four levels: none, mild, moderate, and advanced, using a set of standard photographs ( Figure 1). is study was performed in accordance with the Declaration of Helsinki. Study protocol and conduct were approved by the Institutional Review Board of Kochi University and Saneikai Tsukazaki Hospital.

Images to Be
Analysed. Of all slit lamp photographs taken for clinical purposes at Ophthalmology Department of Tsukazaki Hospital between 01/15/2005 and 07/14/2018, a total of 5,008 photographs were extracted. To make them consistent with the standard JOAS photographs, magnifications of 5× and 8× were used. Slit lamp microscopes by Zeiss Corporation and Hague Straight Corporation were used; the photography conditions such as the amount of light and direction of gaze were not specifically defined. Photographers varied as well.
ere were no particular inclusion criteria in terms of causative diseases. e patients who have subconjunctival hemorrhage were excluded. Also, images taken after ocular fluorescein staining were included in the analysis.
Excluded from the analysis were all images taken through a cobalt or blue-free philtre. e images not taken under generalised illumination were also excluded. e study was conducted in accordance with the tenets of the Declaration of Helsinki. Study protocol and conduct were approved by the Institutional Review Board of Kochi University and Saneikai Tsukazaki Hospital.

Image Data.
e initial 5,008 images were divided into two groups: 4,008 images for preparing the artificial intelligence model (hereafter "for training") and 1,000 images for preliminary validation by graders and for model validation (hereafter "for validation"). An overview of the data flow for training and subsequent validation is provided in Figure 2; details will be described in the appropriate sections below.

Selection of Graders.
In this study, the quadraticweighted kappa coefficient [24] was used to first examine interrater agreement for the 1,000 validation images. is allowed us to determine the quality of responses and evaluate performance by excluding coincidence rates (chance positive results due to data distribution) [24]. JOAS grading was performed by a physician who was a specialist member of the Japanese Society of Allergology and the Japanese Ophthalmological Society (hereafter the "specialist"), as well as four certified orthoptists (COs). e 1,000 validation images were graded individually and completely independently (i.e., no consultations among graders). Five images were excluded due to mistakes, and those images considered to be ungradable by at least one grader were excluded when calculating the weighted kappa coefficient. As a result, a total of 881 images were included in this analysis.
As shown in Table 1, all 4 COs and the specialist graded with weighted kappa coefficients of above 0.7; therefore, they were considered grading experts (hereafter "the experts") for the purposes of reference (correct) JOAS grading scores during training.

Training Data.
e 4,008 training images were randomly divided into two sets of 2,004 images. Two COs (A and B or C and D) graded one image set each using JOAS grading. Images considered to be ungradable by either of the two during the grading process were excluded from analysis. In addition, some images were lost due to data management errors. A total of 200 images were lost from the full training image set. For the remaining 3,808 images, both graders were in agreement for 2,621 images; this set was then used for the training data. e remaining 1,187 images which were inconsistent in grades were graded again by the specialist, who reinstated a total of 1,079 of the inconsistent images. As a result, a total of 3,700 images were included in the training data.

Validation Data.
One thousand images were randomly divided into two sets of 500 images to be used for validation of the system. For each set of 500 images, responses used for the selection of graders were adopted as experts' responses to be used for performance evaluation of the artificial intelligence model. e images with consistent responses were adopted as experts' responses (there were no consistent responses for ungradable images), and some of the images with inconsistent responses were excluded from analysis by the specialist. For those images with inconsistent answers, if they were not excluded from analysis, the experts' responses were adopted. For validation data, 454 images were included for CO A and B and 469 images for CO C and D. ere were 923 images in total.

Building of the Artificial Intelligence Model.
e grade classification of the training data was as follows: Grade 0, 688 images; Grade 1, 1734; Grade 2, 1176, and Grade 3, 102. Image processing was performed on the training data to amplify images as follows: Grade 0, 4 times; Grade 1 and 2, doubled; Grade 3, 18 times. Doing this allowed us to have smaller differences on the number of training images after data expansion between grades. In the image amplification processing, inversion was always carried out. Other processing performed included no corrections, contrast adjustment (increase or decrease), c correction (c � 0.75 or 1.5), histogram equalisation, Gaussian noise addition, and salt and pepper noise addition. Of the nine types of processing employed, the types of processing to be performed on each grade were randomly chosen: 2 types for Grade 0, 1 type for Grade 1, 1 type for Grade 2, and 9 types for Grade 3. With the amplified images, we trained nine types of network structures (VGG16, VGG19, ResNet50, InceptionV3, InceptionResNetV2, Xception, DenseNet121, DenseNet169, and DenseNet201) [25][26][27][28][29][30] Overall 5008 Validation 1000  Figure 2: Image data flow. e top branch represents data flow for creating an artificial intelligence model. e bottom branch represents data flow for preliminary grader evaluation and model evaluation. CO was a certified orthoptist (expert grader); Dr was a doctor who is a specialist in both the Japanese Society of Allergology and the Japanese Ophthalmological Society (specialist grader). e data flow processes for training data and evaluation data were different because defining correct responses required different protocols for each process.  and built nine models. Using each model, we tested the validation data and evaluated the model.

Deep Learning Model and Training.
e VGG16 network structure can be divided into five binding blocks composed of a convolution layer and max pooling layer, as well as the fully connected layer [25].
First, all input images were converted to 256 × 192 pixels in advance, read in 8-bit RGB colour and 256 pixels * 192 pixels * 3 channels tensors. e input was normalised to the range of 0-1 by dividing it by 255. e convolutional layer recognises features of the target through convolutional filters [20][21][22]. e max pooling layer was placed at the end of each block; it reduces the position sensitivity of a feature output from a convolutional layer so that a more general recognition can be performed [31].
Finally, after flattening the three-dimensional matrix, we arranged two layers of the fully connected layer and classified them into four classes by a softmax function. e purpose of the fully connected layer is to remove spatial information from extracted features and to statistically distinguish the target from other feature vectors [32]. Dropout processing was applied to the first fully connected layer to mask out with 50% probability. e purpose of dropout processing is to improve the generalisation performance and prevent overlearning during the training [33].
As an output, the probability distribution for the outcome of the sum becoming 1 is displayed, making the item with the largest value as the output grade.
We used a method called fine-tuning, which uses already-learned parameters with different data. Its objective is to increase the training speed and easily obtain high performance even with a small amount of data [34]. e parameters obtained from learning Imagenet were used as initial values of the parameters for layers other than the fully connected layer, and the training was conducted to obtain appropriate parameters from the initial values. e initial weight update was performed according to an optimisation algorithm called Momentum SGD (learning coefficient � 0.001, inertia term � 0.9), which is one of the stochastic gradient descent methods [35,36]. Categorical cross-entropy was used for a loss function.
Also, each grade was given a different weight for the loss function. Table 2 lists the weight for each grade.
Fine-tuning was performed on other network structures as well. For layers except the fully connected layer, we used parameters by learning Imagenet as initial values, and learning and validation of classification were done using two fully connected layers and a dropout layer. e optimiser and loss function are the same as in the case of VGG16. e construction and validation of the model were carried out using Keras (https://keras.io/en/) which runs Python's TensorFlow backend (https://www.tensorflow.org/). e training and validation of the model were done using the GeForce GTX 1080 Ti GPU by NVIDIA.

Performance Evaluation in a Single Model.
Performance evaluation in each model was performed using the weighted kappa coefficient.
Of all the validation data, we set the weighted kappa coefficients of CO A and CO B as κ ab for data using the responses by CO A and B (κ na and κ nb , respectively). Likewise, CO C and CO D were set as κ cd , κ nc , and κ nd . Here, we set and calculated the evaluation index called kappa distance score (KDS) to find how close the responses of the model were to those of humans. (1)

Comparison of Kappa Coefficients between Models.
For the total of 923 validation images, the weighted kappa coefficients of each model were compared to examine whether any of the models provided a different response. In comparison with other models, we excluded those with an average of weighted kappa coefficient of 0.7 or less and those with the lowest average of weighted kappa coefficients. is was done until there were no more models with an average of weighted kappa coefficients of 0.7 or less in comparison with other models, and only the remaining models were used as the models for performance evaluation in a multimodel.

Performance Evaluation in a Multimodel State.
When using a multimodel, there are two ways to set which of the neural network's responses are to be used: (1) If the responses of more than half of the models of N (number) models match with the expert's responses, the responses are considered the same; otherwise, the grade with the highest number of models' responses are considered as the responses of the multimodel. Based on the responses of (1), the score obtained by the method described in the performance evaluation of a single model was set as KDS (n,half ) and responses obtained in (2) were set as KDS (n,least) .
Seven out of nine models were used for validation. In the order of the highest KDS when performing the single-model performance evaluation for each of the seven models, KDS (n,half ) and KDS (n,least) at N � 2 to 7 were calculated using N models.
Also, KDS (n,half ) and KDS (n,least) values were normalised so that the average would equal 0 and the standard deviation would equal 1, and the total value was set as KDS (n,multi) as shown below: (2)

Quantitative Evaluation.
We compared the responses obtained by the software to the image area occupied by blood vessels in the bulbar conjunctiva as developed by Yoneda et al. [11,12] with the responses by the multimodel. e evaluation screen is shown in Figure 3.
To obtain the multimodel's response, grades provided by the 6-model multimodel were averaged and rounded. Except for Grade 3, images were randomly selected.
We investigated whether there is a correlation between the response of the multimodel and the area occupied, as well as whether there is a significant difference in the area occupied by blood vessels per each response grade of the multimodel.

Statistical Analysis.
We used a Python module called scikit-learn to calculate the weighted kappa coefficients. Although the confidence interval can be obtained from approximate standard error for the weighted kappa coefficients [24], there is no established calculation method for confidence intervals such as KDS using multiple weighted kappa coefficients. erefore, there is no established way to test whether the score exceeds zero.
It is also difficult to examine whether there is a significant difference in score per model. For the above reasons, calculation of confidence intervals and statistical test were not performed for each statistic.
Spearman's rank-correlation test [37] was performed to examine the correlation between Yoneda's software and the multimodel. e Kruskal-Wallis [38] and Steel-Dwass [39] tests were conducted to investigate whether there was a significant difference in the area occupied by blood vessels of each response grade of the multimodel.

Results and Discussion
For each model, we trained the neural network to grade conjunctival hyperaemia using the JOAS system with 3,700 images. We then used 923 other images as validation test images to evaluate how well the model could grade. e average age and female ratio of training images and validation test images were 50.6 (±21.2) year and 57.0%. We calculated weighted kappa coefficients (κ) as interrater reliability measures for the experts (Table 1) and then for each of the models (Table 3). We then used these values to calculate a kappa distance score (KDS) for each model, as well as for a multimodel system in which two or more of the seven best models were combined to give a single final output (i.e., each model in the multimodel got a "vote," and the votes were tallied for the "winning" grade ouptut). e KDS represents how close the model's responses matched the expert clinicians' responses, such that higher values are a closer match and values above zero are considered clinically acceptable. Table 3 shows κ scores and KDSs for each model. We found that the DenseNet201 model was the best individual model (i.e., highest KDS); however, no individual model reached a KDS above zero.

Single-Model Evaluation.
Evaluation of intermodel kappas for multimodel inclusion.
e weighted kappa coefficients between models were also calculated. e average kappa coefficients of the InceptionResNetV2 model and Xception model were below the acceptable threshold of 0.7 and were the two lowest ( Supplementary Tables 1 and 2). us, they were excluded and the following seven models (Supplementary Table 3) were evaluated for performance in a multimodel system: DenseNet201, DenseNet121, VGG19, DenseNet169, VGG16, ResNet50, and InceptionV3. Figure 4 shows the three KDS values assessed for each multimodel system (n � 2 to 7 included models): KDS (n,half ) , KDS (n,least) , and KDS (n,multi) . Of these, KDS (n,half ) represents diagnostic accuracy (the ability to score the correct grade), KDS (n,least) represents diagnostic completeness (the ability to provide at least one correct response when faced with several correct options), and KDS (n,multi) represents the net KDS score (how close the system was to the experts). KDS (n,half ) was above zero for all systems, indicating that when combined, the individual models can achieve clinically acceptable diagnostic accuracy. KDS (n,multi) was highest for the 6-model system, suggesting that combining the DenseNet201, DenseNet121, VGG19, DenseNet169, VGG16, and ResNet50 models produces the best overall outcome (Supplemental Tables 4). An example of a response provided by an actual model is shown in Figure 5. It took 411.0 seconds to generate the graph of the responses for the 923 test images, which translates to a scoring speed of 0.445 seconds per image. Figure 6 shows the multimodel's responses, and the area occupied by blood vessels measured with Yoneda et al.'s software (Supplementary Table 5). eir software was able to measure the area occupied by blood vessels for 71.8% of all images. e correlation between the multimodel responses and the vessel-area occupied was 0.737 (p < 0.01). In addition, significant differences were found between each pair of grades when comparing the measured areas occupied by grade, as well as between grade pairs when comparing the multimodel responses (p < 0.01). is suggests that both the software and the multimodel system can distinguish clearly between grades and that the area of an image occupied by blood vessels can be a quantitative marker for each grade.

Discussion
In this study, we developed a deep learning system that grades the severity of hyperaemia with a high degree of consistency with expert graders and can do so with objective criteria (image area occupied by blood vessels). Hyperaemia grading can have different responses. When using the neural network, or supervised learning, it is necessary to teach input (image) and output (grading) at the same time [13].   erefore, for training purposes, it is necessary to know the correct response for each image, so we used the experts' responses as the correct responses for images when they agreed and the specialist's response when the experts disagreed. For validation, however, it is not required to teach the correct response, so the experts' responses were used to assess responses. us, the data flow differed between the validation data and the training data systems. KDS (n,half ) represents the accuracy of diagnosis, and KDS (n,least) represents the completeness of diagnosis. In other words, a diagnosis would be inaccurate if a system provided random responses across grades; that would be clinically unacceptable. At the same time, when various diagnoses (correct options) are possible, the system could provide responses that do not include any correct response grades, which would be problematic in terms of risk management. e grading by experts can halve this problem. Because KDS (n,half ) was greater than 0 in the 6-model multimodel system, it can be considered to have clinically acceptable accuracy. Because the KDS (n,multi) score was the highest, it is  : Comparison of responses of multimodel and the area occupied by blood vessels. ere is a correlation between the responses of the multimodel system and the area occupied by blood vessels. Also, there is a significant difference in the area occupied by blood vessels between all grades. also considered the most comprehensive diagnostic model among all the combinations of models examined in this study. KDS (n,half ) is markedly lower in the n � 3, 5, and 7 systems. is is because when n is an even number, "more than half" of the models have to provide a correct response to score as correct; on the other hand, when n is an odd number, the "majority" of the models have to provide a correct response to score as correct. at is, in oddnumbered combinations, one more model must be in agreement than in their even-numbered (i.e., one less model/ combination) counterparts; this reduces the likelihood of success, which is represented by a lower KDS.
JOAS grading is used as a subjective indicator by doctors in both research and clinical practice [9]. On the other hand, the area occupied by blood vessels in an image has been applied by Yoneda et al. [11] as an objective indicator in research studies.
e multimodel system created in this study was highly consistent with both the subjective and objective indicators. One might think that it would be best to use the area occupied by blood vessels as an objective indicator in clinical practice, but there is a drawback to doing so. Continuous values (those represented by a range of numbers rather than discreet categories) are rarely used as a basis for decision-making. Far more often, one or more threshold values are set to perform categorical classification. Because the vessel-occupied area is continuous, subjective thresholds must be set, reducing the objectivity of the value. us, it is as meaningful (or more so) for a neural network to use the categorical value, grading, instead of measuring the area occupied by blood vessels. Given these conditions, we believe that our multimodel system, particularly the 6-model system, matches the subjective performance of JOAS grading by clinicians and is, therefore, a clinically relevant model.
One of the strengths of this multimodel system is that it was created using images acquired in routine clinical practice. us, the system does not require imaging methods specific for grading hyperaemia or special imaging devices; instead, it can use images acquired using a standard slit lamp microscope with adjustable angle and magnification, suggesting that it is highly suitable for routine clinical application. For example, a patient being seen for corneal foreign bodies (e.g., pieces of metal) could have associated hyperaemia automatically graded at the same time. is would allow the ophthalmologist to evaluate improvement of the hyperaemia at the time of follow-up for the corneal inclusions. With our system, it is possible to improve the quality of care easily with software and without adding any special equipment. In fact, the software used to measure the area occupied by blood vessels was able to measure only about 70% of all the data used in this study.
ere are several limitations to our prototype system. First, our study results suggest that it would be necessary to clearly illustrate the neural networks' focus area in clinical practice. At present, the neural network should only be used to assist physicians in making a final decision, instead of allowing it to make an independent diagnosis. erefore, it is important to include a function in the system wherein the relevant part of the reasoning used by the neural network is directly communicated to the physician; in fact, a few recent medical papers have addressed one such form of reasoning, segmentation [40,41]. In our system, however, clinically produced hyperaemia images are used, and these will be seen by the physician as part of his or her examination. is obviates the need for that communicative function. Second, this study is a retrospective data search within a single facility, and it is necessary to evaluate the robustness of the model by conducting a prospective study on data from one or more other facilities.

Conclusions
In this study, we developed an artificial intelligence-based grading system that was accurate and highly consistent with grading by clinical experts. We would like to develop this system further by improving the problematic aspects mentioned above, utilising it in an actual clinical setting and adding the necessary functions to allow it to be applied in much more widespread applications.

Data Availability
e clinical data used for the training and validation sets were collected at Tsukazaki Hospital. ey are not publicly available as restrictions apply to their use. e data are available with corresponding author on reasonable request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.