The goal of this study was to systematically assess the agreement of individual human grading experts in evaluating both qualitative as well as quantitative OCT markers in a large representative set of eyes affected by nAMD, DME or macular edema due to RVO, a task that is routinely performed in clinical routine and trial settings. As the presence of retinal fluid impacts functional loss as well as treatment decisions in all three diseases, they were the primary focus of this study. Although IRF and SRF might appear similar in nAMD, DME and RVO, there is no study comparing their actual gradability between the three conditions: interestingly, our results show an obvious difference in the agreement on IRF presence in nAMD (kappa 0.69), DME (0.64) and RVO (0.86). Such levels of disagreement among certified human experts are surprising, especially in an optimized setting that features standardized image acquisition following defined study protocols and a user-friendly platform for professional image grading.
In RVO, the higher relative agreements on IRF presence correspond to previously published results in TD-OCT of 76%-83%[19] and 84%[20]. Typically, large cystoid spaces in a substantially thickened retina affect the ganglion cell- as well as inner- and outer nuclear layer and are seen sub- or parafoveally, making them overall more easily identifiable.[26] This is in contrast to the often subtle hyporeflective spaces in nAMD that might be confused with “pixel voids”, a term that was described as a cyst-like appearance of hyporeflectivity in hyporeflective retinal layers due to low signal intensity, in the absence of actual cystoid changes.[1, 2] The reduced consistency in grading IRF in nAMD might be due to the association of numerous other structural alterations, many of them being degenerative in nature, whereas in RVO, IRF occurs as an acute spreading of fluid into an otherwise unaltered retina. Depending on the underlying lesion type in nAMD, there is additional variability in retinal fluid localisation and extension.[27] In a post-hoc analysis of 270 TD-OCT scans from the CATT, DeCroos et al. assessed the reproducibility between two independent grading teams and found kappas of 0.48, 0.8 and 0.75 for the detection of IRF, SRF and sub-RPE fluid, respectively.[2] In another retrospective case series with AMD eyes, four independently trained retina specialists were asked to grade 112 SD-OCT images and reached agreements of 0.62, 0.82 and 0.60 for the detection of IRF, SRF and PED, respectively, producing similar results as those in our AMD cohort.[9] When comparing the detection of macular fluid between ophthalmologists and a RC, major causes of disagreements were found to be thinner retinas, smaller fluid pockets and greater decrease of retinal thickness at the foveal centre.[1] Keenan et al. confirmed these findings in a recent follow-on study of the AREDS 2 trial. It compared the performance of retina specialists in assessing retinal fluid in SD-OCT images to a deep learning-based algorithm and reported an accuracy of 0.81, a sensitivity of 0.47 and a specificity of 0.97. It was found that IRF was significantly more often missed by the graders when appearing in the absence of SRF or if the mean retinal fluid volume and number of b-scans showing fluid was lower.[7] One may assume that the same factors might also complicate the grading of IRF in DME.[28] Albeit comparable to nAMD, a kappa of 0.64 for the detection of IRF in DME was nevertheless surprising. Worse reproducibility might also be due to small focal oedemas off the macular centre that are more easily missed, especially when overshadowed by hard exudates.
In contrast to the difficult task of IRF assessment, more consistent grading results in our study were found for SRF (nAMD: 0.8; DME: 0.83; RVO: 0.89). This is not unexpected for a feature that affects an anatomically predefined space of the retina and therefore shows less variability in appearance. In nAMD, however, the association of (heterogeneous) hyperreflective material in the subretinal compartment as well as outer retinal degeneration might complicate the grading of SRF. Nonetheless, various studies produced comparable results for the detection of SRF in nAMD with kappas ranging from 0.72 to 0.82.[2, 9, 10, 12] While less comparable results were found in a small-scale DME study[28], there is so far no literature on OCT grading agreement of SRF in RVO.
As seen in previous studies[2, 10-12, 15, 19], measuring CST reached excellent agreements between our graders (ICC:1.0), independent of the disease. While CST is still used as an anatomical outcome measure in clinical trials, it is neither a reliable indicator of disease activity nor does it show a meaningful correlation with visual function over time.[29-33] Pawloff et al. applied a precision AI fluid algorithm to more than 2400 eyes to assess the correlation of (three-dimensional) retinal fluid volumes and (two-dimensional) central retinal subfield thickness, demonstrating a surprisingly low correlation of r=0.57 in nAMD.[32]
With the increasingly important role of retinal imaging in clinical trials, there has been a growing trend towards centralizing decisions on patient eligibility and disease monitoring in RCs. To ensure a high degree of standardization while keeping bias and variability at a minimum, RCs operate under standards that cover not only technical aspects, but also image acquisition, interpretation and documentation. Image gradings are based on clear feature definitions and are performed by certified graders who receive a study- and/or disease specific training. Gradings are often based on a dual reading, where images are assessed by two independent graders who are supervised by a third more experienced grader or retina specialist. In our study, no additional training nor supervision was conducted. Therefore, the results presented herein do not fully reflect the general reproducibility of OCT grading at a RC, but more importantly the reproducibility of a human expert grading in the real-world.
While the employment of any of these RC-specific measures in a real-world practice might be beneficial for patient and clinician, their adoption will likely be complicated by cost and time constraints, as well as the professional training and experience of the individual clinician: CATT was an important endeavor that compared the treatment decision by ophthalmologists versus that of a RC.[1] Any macular fluid, as seen on OCT, mandated the administration of anti-VEGF injections. Prior to study initiation, treating ophthalmologists were required to perform an investigator training and pass a knowledge assessment test involving the interpretation of OCTs. Notwithstanding, there were marked discrepancies in the identification of macular fluid in 1737 of 6210 visits (=28%), most commonly in visits where the RC detected macular fluid while clinicians did not. This is of significant relevance, considering that in nAMD, ophthalmologists prefer to base their treatment decisions on structural OCT changes rather than visual acuity or FDA labelling (American Society of Retina Specialists, 2020. Global Trends in Retina).
The limited reproducibility seen in our and previous studies raises the question whether OCT image assessment, as we know it, has reached its maximum potential. Despite the resources available to a RC, manual gradings remain laborious, and to a certain extent inconsistent and inefficient: OCT imaging holds information that is generated by millions of pixels per volume; however, an image grading that merely assesses qualitative aspects (e.g., feature presence/absence) or two-dimensional parameters (e.g., CST, feature height) does not capture the great quantity of available structural data.
Artificial intelligence (AI)-based algorithms are promising tools that allow a more precise and objective evaluation of the continuously increasing imaging data. An automated detection of retinal fluid is capable of determining not only fluid presence, but also subtype, location and volume.[21] An accurate assessment of retinal fluid in exudative diseases is most important, as increasing fluid volumes at each compartment have been shown to negatively impact BCVA outcomes, independent of the therapeutic substance used.[34] Recording compartmental and volume-based parameters over time will help to identify clinically meaningful thresholds for retinal fluid and standardize treatment decisions between clinicians. This is vital for the individual patient, as both under- and overtreatment should be avoided at any cost. While the translation of findings from clinical trials to the general population is often limited by strict inclusion or exclusion criteria, the application of objective metrics in this setting might mitigate this problem. Most importantly, because the results of automated feature detection can be shared by the cloud in real time, study sites could be freed from the delayed feedback of RCs, thereby enhancing patient enrolment and study visits. Real-time AI-based feedback at any level of a randomized clinical trial (e.g., patient screening, monitoring and final data analysis) substantially saves human and financial resources and increases the transparency for investigators and sponsors.
While the focus of this study was on exudative changes, the sample size was too small to come to conclusions on less frequently seen OCT changes such as epiretinal membranes, macular holes and macular atrophy. Graders are typically aware of the underlying condition when grading; it is uncertain whether the simultaneous presentation of OCT images from different diseases introduced a grading bias (e.g., a small area of hyporeflectivity in an eye with nAMD might more readily be graded as IRF if presented after a consecutive series of eyes with RVO showing obvious IRF).
In conclusion, our systematic evaluation of human expert agreement on OCT biomarkers in nAMD, DME and RVO found an SRF agreement that was rather consistent across all three conditions. However, there was a substantial grading disagreement concerning IRF in nAMD and DME. Importantly, any image assessment by a human, even in the highly standardized setting of a RC, remains laborious and to a certain degree subjective. Our goal should therefore focus on the adoption of automated imaging analysis tools for a more precise, efficient and objective image assessment. Furthermore, enhanced collaborations of different reading centres in large-scale clinical studies call for the harmonization and standardization of grading procedures not only within, but between centres.