A Global Review of Publicly Available Datasets Containing Fundus Images: Characteristics, Barriers to Access, Usability, and Generalizability

This article provides a comprehensive and up-to-date overview of the repositories that contain color fundus images. We analyzed them regarding availability and legality, presented the datasets’ characteristics, and identified labeled and unlabeled image sets. This study aimed to complete all publicly available color fundus image datasets to create a central catalog of available color fundus image datasets.


Introduction
Research and healthcare delivery are changing in the digital age. Digital health research and deep-learning-based applications are promising to transform some of the ways we care for our patients and expand access to healthcare in both developed and underprivileged regions of the world [1][2][3]. Automated screening for diabetic retinopathy (DR) is one facet of this transformation, with deep-learning algorithms already supplementing clinical practice in different parts of the world [4,5].
As the barrier to entry for creating deep-learning-based applications significantly diminished over the last few years, many smaller companies and institutions now attempt to create their algorithms for healthcare, particularly for image-based analysis [6]. Radiology and ophthalmology are medical specialties for which deep learning is most applicable due to their reliance on images and visual analysis [7][8][9]. Color fundus photos, retinal and anterior chamber optical coherence tomography (OCT) scans, and visual field analyzer reports can all lend themselves to automatic analysis for various possible pathologies [9,10].
Creating these algorithms requires sizable numbers of initial images, both with and without pathology. Data are needed in every step of developing a deep-learning application. In the modern world, with electronic healthcare records, centralized imaging storage, and the pervasiveness of digital solutions and storage, such data are generated worldwide in massive quantities due to access, cost, and healthcare issues. Privacy and data governing laws, lack of centralized databases, heterogeneity within particular datasets, lack of or insufficient labeling, or the sheer volume of images required to access such data are often challenging. This article provides an overview of the repositories containing color fundus photos. We analyze them in terms of availability and legality of use. We present the characteristics of datasets and identify labeled and not labeled sets of images. We also analyze the origin of the datasets. This review aims to complement all publicly available • The availability of the datasets; • A breakdown of the legality of using the datasets; • A classification of the image descriptions; • Geographical distribution of the datasets that are available by continent and country.

Methods
We used the information presented in the review of publicly available ophthalmological imaging datasets [11]. Each dataset includes details about its accessibility, data access, file types, countries of origin, number of patients undergoing examination, number of all images taken, ocular diseases, types of eye examinations performed, and the device used. We have extended the information about all the datasets marked as available in the review mentioned above. We found 47 such color fundus image repositories.
We then used well-known tools to find other repositories not described in the mentioned papers. Searching for color fundus image repositories consisted of typing different types of terms into three types of search engines, including "fundus", "retina" and "retinal image" along with the words "dataset", "database" and "repositories". The exact search was done in the Google search engine and the Google Dataset search engine, designed to search online datasets. Google Dataset Search is designed for online repository discovery and supports searching for tabular, graphical, and text datasets. Indexing is available for publishing their dataset with a metadata reference schema. All results from the search describe the dataset's contents, direct links, and file format. Google searches also included terms related to images of the retina and terms related to datasets. For both searches, we considered the first ten pages of results. We found 17 unique repositories: 5 using the Google search engine and 12 using the Google Dataset search engine.
The third search engine we chose was Kaggle. Kaggle is a data science and artificial intelligence platform on which users can share their datasets and examine the datasets shared by others. Kaggle datasets are open-sourced, but to determine for what purposes these datasets can be used, we need to check the datasets' licenses. The vast majority of Kaggle datasets are reliable. We can judge a dataset's reliability by looking at its upvotes or reviewing the notebooks shared using the dataset. We used the same types of terms as with the Google search engine and Google Dataset search engine. We found 61 unique repositories. We investigated the actual condition of files with their total size, image sizes, information about image description, additional artifacts found in images, issues of legality of data used in scientific applications, and visualizations of sample data. We did not exclude any color fundus image datasets based on the age, sex, or ethnicity of the patients from whom data was collected. We also included datasets of all languages and geographic origins.

Dataset Checking Strategy
We noticed that the levels of access to the datasets varied: from fully accessible to available on request after sending a request to the authors. Some datasets were also unavailable. In this article, we have defined access levels as follows: (1) Fully open; (2) Available after completing a form; (3) Available after account registration (and possible approval by the authors); (4) Available after sending an email to authors and approval from them; (5) Not available.
After accessing, we manually checked each dataset described in 1 by downloading them to extract information about file status, sizes, and additional artifacts found in the images. Most available datasets were available as compressed files (ZIP/RAR/7Z), but some were available as separate files. We determined the sizes of the datasets in which the files were delivered separately by downloading all files and summing their sizes. We have prepared a tool to automatically generate the discussed information on repository contents and image samples-Ophthalmic Repository Sample Generator (Section 4.1). We also manually checked the content of each randomly selected image to see if there were any additional artifacts.

Image Descriptions and Legality of Use
All the datasets we reviewed were described on dedicated web pages or in scientific publications. Based on these sources, we have determined methods of describing images included in these datasets.
We noticed the following types of image descriptions: (1) Manually assigned labels corresponding to diagnosed ocular diseases, image quality, or described areas of interest; (2) Manual annotations on images indicating areas of interest; (3) No descriptions.
We have also extracted information on the legality of data use from websites dedicated to the datasets. We noticed the following approaches to defining the legality of the use of data contained in the datasets: (1) Notifying the authors of the datasets of the results and awaiting permission to publish the results; (2) References to the indicated articles or the dataset in the case of publication of the results; (3) No restrictions.

Data Availability
In the analysis, we used 121 publicly available datasets containing color fundus images [12-127].

Ophthalmic Repository Sample Generator
We developed a generator of pseudo-random samples from publicly available repositories containing color fundus photos. The generator is written in Python 3 programming language and is available on GitHub (https://github.com/betacord/OphthalmicReposito rySampleGenerator; accessed on 16 May 2023).
The prepared tool facilitates the manual inspection of the contents of repositories. The program obtains the URL of a given repository (or ID on the Kaggle platform) and the sample size (n). The operation result will be a pseudo-random selection of n color fundus photos from the repository and a CSV file containing extracted attributes representing the entire repository. The tool can be easily run on a local computer or in a cloud environment. The general scheme of the generator is shown in Figure 1. As an input, the generator takes the sample size, the repository URL, the data output file path, the temporary full data path, the repository sample output path, the repository type, and the output CSV file path.

•
The sample size is an integer number representing the size of the random output sample of photos from the repository. • A repository URL is a string representing a direct URL of the image file; e.g., for the Kaggle dataset, the schema is [username]/[dataset_id]. In the case of a Kaggle competition, it is ID. • The data output file path is a string representing the output file with the downloaded repository content. • The temporary full data path is a string representing the temporary path to which the repository will be extracted. • The repository sample output path is a string representing the path where a randomly selected repository sample will be placed. • The repository type is an integer representing the type of the repository source: 0 for classic URL, 1 for Kaggle competition, and 2 for Kaggle dataset. • The output CSV file path is a string representing the path where the CSV file will be saved (separated by ;) containing information about the repository.
Therefore, external parameters characterizing the size of the generated sample of images, data source, temporary paths, source type, and paths to the output files should also be included in the tool's run.

Results
In Table 1, we included the results of our review. In total, we checked 127 repositories containing color fundus images, of which 120 were currently available, and seven were unavailable due to a non-existent URL. Downloading one dataset was prevented due to a critical server error, and one dataset was delivered as a corrupt zipped file. We have described the characteristics only for the available datasets. We also generated a sample of their content and placed it in the cloud (https://shorturl.at/hmyz3; accessed on 16 May 2023).

Data Access
Out of the 127 available datasets, we marked 37 as fully open, 6 as available after completing a form, 75 as available after account registration, 2 as available after sending an email to authors and approval from them, and 7 as not available, as can be seen in Figure 2.

Characteristics of Datasets
Almost all (122 out of 124) of the datasets could be downloaded as zipped files, and only 2 could be downloaded separately. There was a problem with the extension on three of the zipped files that contained datasets. In 59 datasets, all images had the exact dimensions (in pixels), but there were 68 unique ones. Images from nine different datasets contained additional artifacts such as dates, numbers, digits, color scales, markers, icons, arteries, vessels, veins, and key points marked on the photographs.

The Legality of Use and Image Descriptions
Out of the 127 datasets, the authors of 3 of them provided a note about the need to inform them about the obtained results. Authors of 44 datasets provided information about the need to cite the indicated works using the provided data and publishing the results. Over two-thirds, or 80 datasets, had no restrictions on use. Figure 3 shows the full breakdown of the legality of using data contained in the datasets.
Eighty-nine datasets were labeled with the images assigned to them. Thirty datasets had areas of interest labeled on the images. Sixteen datasets did not have any descriptions. A full breakdown of the image descriptions is shown in Figure 4.

Discussion
Publicly available datasets remain important in digital health research and innovation in ophthalmology. Although, on the whole, the number of publicly available color fundus image datasets is growing, it is an ongoing process with new datasets arriving and older datasets becoming inaccessible. A central repository or listing for ophthalmic datasets, coupled with the low discoverability of many of the datasets, constitutes a significant barrier to access to high-quality representative data suitable to a given purpose. However, this article provided an up-to-date review and discussion of available color fundus image datasets.
Health data poverty, in this case, scarcity of color fundus image datasets originating from underprivileged regions, particularly Africa, is cause for concern. Although the relationship between patients' ethnicity, background, and other attributes and fundus features is not clearly documented, the lack of representative datasets might lead to ethnic or geographical bias and poor generalizability in deep-learning applications. The recent relative lack of datasets might mean underrepresented regions miss out on future datadriven screening and healthcare solutions benefits.
The study published by Khan and colleagues was the first comprehensive and systematic listing of public ophthalmological imaging databases. In their work, out of 121 datasets identified through various searching strategies, only 94 were deemed truly available, with 27 databases being inaccessible even after multiple attempts spaced weeks apart [11]. Therefore roughly one-fourth of datasets were inaccessible at the time of their review in work mentioned above. It is in line with our findings in writing this update. Out of 94 datasets marked available by the authors, only 74 were available at the time of preparing this article, just over 1 year from the initial paper publication of Khan and colleagues and just 15 months after the first online publishing. Therefore, access to over one-fifth (21%) of datasets was lost in fewer than 2 years, similar to the 22% found inaccessible in the original review. Almost all of the datasets that became unavailable since the publishing of Khan's study were offline-the dedicated websites are unreachable, with two unavailable due to errors.
Of the unavailable datasets, 7 became unavailable during the identification, verification, and review of the newly discovered datasets out of 127 color fundus image repositories initially identified for this analysis. Although the initial period between publishing the individual datasets and becoming inaccessible has yet to be discovered, it is clear that datasets going offline or otherwise becoming unreachable is an ongoing process, and information on availability can quickly become obsolete. It is important to note that while Khan et al. published a list of datasets containing OCT and other imaging modalities, this review focuses specifically on datasets of color fundus images [11].
Given adequate citation of sources, all but three of the datasets allowed unrestricted access and publishing of results for scientific, non-commercial purposes. The two exceptions required approval from dataset authors before publishing any results, which may limit their usability in scientific regard, leaving potential publication opportunities to the whim of original dataset authors. More than half of datasets do not impose any restrictions and do not explicitly require citations, though quoting sources is one of the fundamental ethical principles of scientific use.
Most datasets contain additional information about individual images. More than half of the datasets (66%) contained manually assigned text-based labels corresponding to diagnosed ocular diseases, image quality, or described areas of interest. One-fifth (22%) of datasets contained annotations indicating areas of interest or pathology. Only 12% of datasets contained raw images without metadata for individual images. Figures 5 and 6 explore the geographical origins of the datasets. Almost half of the datasets for which a region of origin could be established originated from Asia, with Europe making up another one-third. Overall, out of 73 datasets, 24 originated from outside of Europe or Asia, with none of the datasets originating from Africa and a nearly equal split between North and South America. The distribution of datasets available from individual countries is shown in Figure 6. Although dataset origin relates to the location of the person or organization sharing the dataset and does not necessarily represent the origin of patients' images, the complete lack of images from Africa is concerning.  Africa, particularly sub-Saharan Africa, is a vastly underserved region with one of the lowest numbers of ophthalmologists in the population globally [167]. There are, on average, three ophthalmologists per million populations in sub-Saharan Africa, compared to about 80 in developed countries [167]. Although this is likely one of the reasons for the lack of available datasets from the region, it is also the rationale for the need for datasets from this region. Digital healthcare solutions, including deep-learning software, may help alleviate some of the healthcare disparities in the region. However, these require development or at least validation on the target validation to avoid any potential for racial or other populationspecific trait bias [168]. It remains steadfast in the case of color fundus images where other than background fundus pigmentation, the influence of patients' attributes such as age, sex, or race on fundus features and their variations are not well known. It is also currently being tackled using deep-learning methods [169]. The suspicion of poor generalizability in populations outside of the ethnic or geographical scope of the initial training image data and, subsequently, the need for the development and validation of multi-ethnic populations is not a new concept in the automated analysis of color fundus images [11,149,170,171]. Serener et al. have shown that the performance of deep-learning algorithms for detecting diabetic retinopathy in color fundus images varies based on geographical or ethnic traits of the training and validation populations [171].

Conclusions
Open datasets are still crucial for digital health research and innovation in ophthalmology. Even though the public has access to more datasets with color fundus images, new datasets are always being added, making older datasets inaccessible. There are only a few places to store or list ophthalmic datasets, and many of them are hard to find, making it hard to obtain high-quality, representative data useful for a given purpose. This paper discussed the many color fundus image datasets that are now available and gave an up-to-date review. Data Availability Statement: Data sharing not applicable No new data were created or analyzed in this study. Data sharing is not applicable to this article.