Validating the Generalizability of Ophthalmic Artificial Intelligence Models on Real-World Clinical Data

Purpose This study aims to investigate generalizability of deep learning (DL) models trained on commonly used public fundus images to an instance of real-world data (RWD) for glaucoma diagnosis. Methods We used Illinois Eye and Ear Infirmary fundus data set as an instance of RWD in addition to six publicly available fundus data sets. We compared the performance of DL-trained models on public data and RWD for glaucoma classification and optic disc (OD) segmentation tasks. For each task, we created models trained on each data set, respectively, and each model was tested on both data sets. We further examined each model's decision-making process and learned embeddings for the glaucoma classification task. Results Using public data for the test set, public-trained models outperformed RWD-trained models in OD segmentation and glaucoma classification with a mean intersection over union of 96.3% and mean area under the receiver operating characteristic curve of 95.0%, respectively. Using the RWD test set, the performance of public models decreased by 8.0% and 18.4% to 85.6% and 76.6% for OD segmentation and glaucoma classification tasks, respectively. RWD models outperformed public models on RWD test sets by 2.0% and 9.5%, respectively, in OD segmentation and glaucoma classification tasks. Conclusions DL models trained on commonly used public data have limited ability to generalize to RWD for classifying glaucoma. They perform similarly to RWD models for OD segmentation. Translational Relevance RWD is a potential solution for improving generalizability of DL models and enabling clinical translations in the care of prevalent blinding ophthalmic conditions, such as glaucoma.


Data Specifications
The data specifications for the commonly studied publicly available fundus datasets, including RIGA, Drishti-GS, REFUGE, RIM-ONE-DL, as well as the instance of RWD sample used in this study are summarized in Table 1.Also, as discussed in the "Manual Labels for Glaucoma Classification by Experts" section of the manuscript, two glaucoma physicians have manually labeled images in the test sets of RWD and public datasets.Some images from REFUGE, RIM-ONE DL, and RWD samples were labeled as non-gradable or anonymous by the physicians, and this information has been included in the "Image quality" column as "gradable/nongradable".However, since the physicians did not label the data from Drishti-GS and RIGA datasets, the "Image quality" column for those two datasets has been filled with "details unavailable".

Evaluation Metrics
We evaluated the segmentation results with the Intersection over Union (IoU) which measures the area of overlap between the predicted OD mask and the ground truth OD mask divided by their union area as defined in Equation (1).Further, we evaluated the prediction results for glaucoma classification using accuracy (Acc), sensitivity (SEN), precision (PPV),  1 -score, as defined in Equations ( 2)-( 5).Equation (1) shows the metrics used for evaluating the performance of our trained segmentation model.Where true positive (TP) indicates the number of pixels in the coverage areas between the ground truth mask and the predicted mask; False Positive (FP) expresses the number of pixels in the area where the pixel is classified only in the predicted area and is excluded belonging to the ground truth; False Negative (FN) is the number of pixels in the area where a pixel is classified only in the ground truth mask and is excluded belonging to the predicted mask area.Further, Equations (2)-( 5) show the metrics used for evaluating the performance of our trained glaucoma classification model.Where TP (True Positive) represents the positive cases (i.e., glaucoma) predicted to be positive, FN (False Negative) represents the positive cases predicted to be negative (i.e., non-glaucoma), TN (True Negative) represents the negative cases predicted to be negative, and FP (False Positive) represents the negative cases predicted to be positive.We further evaluated Area Under the Receiver Operating Characteristic Curve (AUROC) which shows the trade-off between SEN (or true positive rate) and false positive rate (FPR), defined in equation ( 6), across different decision thresholds (e.g., 0.1, 0.5).
(1) From several hospitals and clinical studies from Chinese patients (details unavailable).

An instance of RWD
A spectrum of glaucoma suspect, mild, moderate, and severe glaucoma (1) Zeiss CIRRUS photo 800.
(1) Illinois Eye and Ear Infirmary  2 shows the feature spread (TSNE covariance trace) and similarity (Wasserstein distance) for the union of public datasets and RWD shown in Figure 1.The trace of the T-SNE covariance matrix for the union of public train and test sets is 0.086 and 0.089, respectively.For the RWD train and test sets, the values are 0.147 and 0.141.The Wasserstein distances for both the public train and test sets and the RWD train and test sets are 0.003, and 0.005, respectively.The Wasserstein distance between the public and RWD train sets is 0.164, and between their test sets is 0.202.*  stands for the test set.In addition to Table 2 results, generated by a pre-trained off-the-shell ResNet50 model, we summarized similar results when using the public and RWD glaucoma classification models in Table 3.The Wasserstein distance between the feature maps of the public train and test sets using the public model is 0.005.The computed distance for the RWD train and test sets using the RWD model is 0.003.The Wasserstein distance between public and RWD test sets, using public model, is 0.140, and using RWD model is 0.060.Table 3

IoU Per Test Image in OD Segmentation
To elaborate on the performance of each trained OD segmentation model, we examined the distribution of IoU across test images.Figure 2 (A) shows box plots of IoU per image using public-public, public-RWD, RWD-RWD, and RWD-public models for segmenting OD.We showed the mean and median IoU per model by a red and a blue line per box plot, respectively.As shown in Figure 2 (A), we found that the public model has the lowest IoU variability ( = 0.04) for segmenting OD on the public data, but this variability increased by 350% when we tested the model on RWD ( = 0.18).Since the public data is homogeneous, the public model performs nearly identical across public test data resulting in a low IoU variability.In contrast, since RWD is heterogeneous, the public model does not have an identical performance across the RWD test set resulting in a high IoU variability.Similarly, the IoU variability of the RWD model from testing it on public data to RWD increases by 28% from  = 0.07 to  = 0.09.
The increment in IoU variability from public data to RWD using either public or RWD trained model shows that the RWD is more challenging than public data.Further, in Figure 2 (B), and (C), in each row, we visualized the original image, ground truth OD mask, and predicted OD mask when the IoU for an image is less than 0.5 to identify challenging images in public-RWD, public-public, RWD-RWD, and RWD-public experiments, respectively.We found that all images in the public-public experiment have an IoU greater than 0.5.As indicated in Figure 2 (B), we found that images with multiple bright spots, small, or dark OD regions in the RWD test set are challenging for the public model to predict accurately (IoU < 0.5).In contrast, as indicated in Figure 2 (C), the RWD model results in an IoU less than 0.5 for solely 0.02% of images in the RWD test set.

Generalization on A Larger Sample
We further tested the RWD and public-trained glaucoma classification models on a larger unseen sample of our RWD consisting of 5,275 images.We summarized the comparison results for public versus RWD-trained models on the 5,275 images in Table 4.As indicated, the RWDtrained model outperforms the public-trained model by 15% higher accuracy, showing that it generalizes majorly better to the ~5.2 unseen RWD.Similarly, the RWD-trained model achieves higher SEN, PPV, and  1 score when tested on ~5.2 unseen RWD than the publictrained model.

T-SNE Visualization
In Figure 4 (A)-(D), we showed the 2D projection of learned features by public-public, RWD-RWD, public-RWD, and RWD-public models, respectively.TSNE results in parts (A) and (C) of Figure 4 are generated with public model and the TSNE results in parts (B) and (D) are generated with RWD model.Projected glaucoma and non-glaucoma features per test image are shown as red circles and blue circles, respectively.We found that there is a clear cluster pattern between glaucoma and non-glaucoma features when we used the public-trained model on the public test data, as shown in Figure 4 (A).However, this cluster pattern disappeared when we tested the public-trained model on RWD, as shown in Figure 4 (C).Therefore, our results suggest that while the public-trained model can clearly discriminate between glaucoma and nonglaucoma images in public test data, it cannot maintain its performance on the RWD.In contrast, based on Figure 4 (B) and (D), the RWD-trained model reveals a similar pattern in discriminating between glaucoma and non-glaucoma images across datasets (public, RWD).

Figure 1 .
Figure 1.ResNet-50 features visualization for classification data (cropped fundus images) via T-SNE.T-SNE results shown as original input images for (A) train sets, and (B) test sets.T-SNE results shown as scatter points for (C) train sets, and (D) test sets.In parts (A) and (C), the gray color represents test data on the train set plots.Conversely, in parts (B) and (D), the gray color represents train data on the test plots.

Figure 2 .
Figure 2. Inspection of Intersection over Union (IoU) for public-public, RWD-RWD, public-RWD, and RWD-public.(A) Box plots of IoU per test image for Public-RWD, Public-Public, RWD-RWD, and RWD-Public experiments.(B), (C) Challenging images that resulted in low OD segmentation performance.Each row respectively contains original images, grand truth masks, and predicted masks with an IoU < 0.5.

Table 1 .
Data specifications for the commonly used publicly available data and the instance of real-world data (RWD) used in this study.35

Table 2 .
Comparison of feature spread (T-SNE covariance trace) and similarity (Wasserstein distance) between the union of public datasets and real-world data (RWD) using off-the-shellResNet-50 model.Public tr RWD tr Public te RWD te Public tr Public te RWD tr RWD te *  stands for the train set.
Public te RWD te Public tr Public te Public te RWD te RWD tr RWD te *  stands for the test set.

Table 4 .
Comparison results of public-versus Real-World Data (RWD)-trained models tested on 5,275 unseen samples of our RWD for glaucoma classification.