The NCI Imaging Data Commons as a platform for reproducible research in computational pathology

Background and Objectives : Reproducibility is a major challenge in developing machine learning (ML)-based solutions in computational pathology (CompPath). The NCI Imaging Data Commons (IDC) provides >120 cancer image collections according to the FAIR principles and is designed to be used with cloud ML services. Here, we explore its potential to facilitate reproducibility in CompPath research. Methods : Using the IDC, we implemented two experiments in which a representative ML-based method for classifying lung tumor tissue was trained and/or evaluated on different datasets. To assess reproducibility, the experiments were run multiple times with separate but identically configured instances of common ML services. Results : The AUC values of different runs of the same experiment were generally consistent. However, we observed small variations in AUC values of up to 0.045, indicating a practical limit to reproducibility. Conclusions : We conclude that the IDC facilitates approaching the reproducibility limit of CompPath research (i) by enabling researchers to reuse exactly the same datasets and (ii) by integrating with cloud ML services so that experiments can be run in identically configured computing environments.


Introduction
Computational pathology (CompPath) is a new discipline that investigates the use of computational methods for the interpretation of heterogeneous data in clinical and anatomical pathology to improve health care in pathology practice.A major focus area of Comp-Path is the computerized analysis of digital tissue images [1].These images show thin sections of surgical specimens or biopsies that are stained to highlight relevant tissue structures.To cope with the high level of complexity and variability of tissue images, virtually all state-of-the-art methods use sophisticated machine learning (ML) algorithms such as Convolutional Neural Networks (CNN) [2].
Because CompPath is applicable in a wide variety of use cases, there has been an explosion of research on ML-based tissue analysis methods [3,4].Many methods are intended to assist pathologists in routine diagnostic tasks such as the recognition of tissue patterns for disease classi cation [5][6][7][8][9].Beyond that, CompPath methods have also shown promise for deriving novel biomarkers from tissue patterns that can predict outcome, genetic mutations, or therapy response [3].

Reproducibility challenges
In recent years, it has become increasingly clear that reproducing the results of published ML studies is challenging [10][11][12][13].Reproducibility is commonly de ned as the ability to obtain "consistent results using the same input data, computational steps, methods, and conditions of analysis" [14].Di culties related to reproducibility prevent other researchers from verifying and reusing published results and are a critical barrier to translating solutions into clinical practice [15].In most cases, reproducibility problems seem to stem not from a lack of scienti c rigor, but from challenges to convey all details and set-up of complex ML methods [12,15,16].In the following, we provide an overview of the main challenges related to ML reproducibility and the existing approaches to address them.
The rst challenge is the speci cation of the analysis method itself.ML algorithms have many variables, such as the network architecture, hyperparameters, and performance metrics [16][17][18].ML work ows usually consist of multiple processing steps, e.g., data selection, preprocessing, training, evaluation [18].Small variations in these implementation details can have signi cant e ects on performance.To make all these details transparent, it is crucial to publish the underlying source code [15].Work ows should be automated as much as possible to avoid errors when performing steps manually.Jupyter notebooks have emerged as the de facto standard to implement and communicate ML work ows [19].By combining software code, intermediate results and explanatory texts into "computational narratives" [20] that can be interactively run and validated, notebooks make it easier for researchers to reproduce and understand the work of others [19].
The second challenge to reproducibility is the specication and setup of the computing environment.ML work ows require signi cant computational resources including, e.g., graphics or tensor processing units (GPUs or TPUs).In addition, they often have many dependencies on speci c software versions.Minor variations in the computing environment can signi cantly a ect the results [13].Setting up a consistent computational environment can be very expensive and time consuming.This challenge can be partially solved by embedding ML work ows in virtual machines or software containers like Docker [21].Both include all required software dependencies so that ML work ows can be shared and run without additional installation e ort.Cloud ML services, like Google Vertex AI, Amazon Sage-Maker, or Microsoft Azure Machine Learning, provide an even more comprehensive solution.By o ering precon gured computing environments for ML research in combination with the required high-performance hardware, such services can further reduce the setup e ort and enable the reproduction of computationally intensive ML work ows even if one does not own the required hardware.They also typically provide webbased graphical user interfaces through which Jupyter notebooks can be run and shared directly in the cloud, making it easy for others to reproduce, verify, and reuse ML work ows [21].
The third challenge related to ML reproducibility is the speci cation of data and its accessibility.The performance of ML methods depends heavily on the composition of their training, validation and test sets [13,22].For current ML studies, it is rarely possible to reproduce this composition exactly as studies are commonly based on speci c, hand-curated datasets which are only roughly described rather than explicitly de ned [17,23].Also, the datasets are often not made publicly available [15], or the criteria/identi ers used to select subsets from publicly available datasets are missing.Stakeholders from academia and industry have de ned the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles [24], a set of requirements to facilitate discovery and reuse of data.FAIR data provision is now considered a "must" to make ML studies reproducible and the FAIR principles are adopted by more and more public data infrastructure initiatives and scienti c jour-nals [25].
Reproducing CompPath studies is particularly challenging.To reveal ne cellular details, tissue sections are imaged at microscopic resolution, resulting in gigapixel whole-slide images (WSI) [26].Due to the complexity and variability of tissue images [27], it takes manyoften thousands-of example WSI to develop and test reliable ML models.Processing and managing such large amounts of data requires extensive computing power, storage resources, and network bandwidth.Reproduction of CompPath studies is further complicated by the large number of proprietary and incompatible WSI le formats that often impede data access and make it dicult to combine heterogeneous data from di erent studies or sites.The Digital Imaging and Communications in Medicine (DICOM) standard [28] is an internationally accepted standard for storage and communication of medical images.It is universally used in radiology and other medical disciplines, and has great potential to become the uniform standard for pathology images as well [29].However, until now, there have been few pathology data collections provided in DICOM format.

NCI Imaging Data Commons
The National Cancer Institute (NCI) Imaging Data Commons (IDC) is a new cloud-based repository within the US national Cancer Research Data Commons (CRDC) [30].A central goal of the IDC is to improve the reproducibility of data-driven cancer imaging research.For this purpose, the IDC provides large public cancer image collections according to the FAIR principles.
Besides pathology images (bright eld and uorescence) and their metadata, the IDC includes radiology images (e.g., CT, MR, and PET) together with associated image analysis results, image annotations, and clinical data providing context about the images.At the time of writing this article, the IDC contained 128 data collections with more than 63,000 cases and more than 38,000 WSI from di erent projects and sites.The collections cover common tumor types, including carcinomas of the breast, colon, kidney, lung, and prostate, as well as rarer cancers such as sarcomas or lymphomas.Most of the WSI collections originate from The Cancer Genome Atlas (TCGA) [31] and Clinical Proteomic Tumor Analysis Consortium (CPTAC) [32] projects and were curated by The Cancer Imaging Archive (TCIA) [33].These collections are commonly used in the development of CompPath methods [7,[34][35][36].
The IDC implements the FAIR principles as follows: Interoperability: While the original WSIs were provided in proprietary, vendor-speci c formats, the IDC harmonized the data and converted them into the open, standard DICOM format [29].DICOM de nes data models and services for storage and communication of medical image data and metadata, as well as attributes for di erent real-world entities (e.g., patient, study) and controlled terminologies for their values.In DICOM, a WSI corresponds to a "series" of DICOM image objects that represent the digital slide at di erent resolutions.Image metadata are stored as attributes directly within the DICOM objects.
Accessibility: The IDC is implemented on the Google Cloud Platform (GCP), enabling cohort selection and analysis directly in the cloud.Since IDC data are provided as part of the Google Public Datasets Program, it can be freely accessed from cloud or local computing environments.In the IDC, DICOM objects are stored as individual DICOM les in Google Cloud Storage (GCS) buckets and can be retrieved using open, free, and universally implementable tools.
Findability: Each DICOM le in the IDC has a persistent universally unique identi er (UUID) [37].DICOM les in storage buckets are referenced through GCS URLs, consisting of the bucket URL and the UUID of the le.Images in the IDC are described with rich metadata, including patient (e.g., age, sex), disease (e.g., subtype, stage), study (e.g., therapy, outcome), and imagingrelated data (e.g., specimen handling, scanning).All DICOM and non-DICOM metadata are indexed in a BigQuery database [38] that can be queried programmatically using standard Structured Query Language (SQL) statements (see section "IDC data access"), allowing for an exact and persistent de nition of cohorts for subsequent analysis.
Reusability: All image collections are associated with detailed provenance information but stripped of patientidenti able information.Most collections are released under data usage licenses that allow unrestricted use in research studies.

Objective
This paper explores how the IDC and cloud ML services can be used in combination for CompPath studies and how this can facilitate reproducibility.This paper is also intended as an introduction to how the IDC can be used for reproducible CompPath research.Therefore, important aspects such as data access are described in more detail in the Methods section.

Overview
We implemented two CompPath experiments using data collections from the IDC and common ML services (Fig-ure 1).Since the computing environments provided by cloud ML services are all virtualized, two identically con gured instances may run di erent host hardware and software (e.g., system software versions, compiler settings) [13].To investigate if and how this a ects reproducibility, both experiments were executed multiple times, each in a new instance of the respective ML service.
Experiment 1 replays the entire development process of the method, including model training and validation.Experiment 2 performs inference with a trained model on independent data.The model trained in Experiment 1 was used as the basis for Experiment 2. The two experiments were conducted with di erent collections in the IDC: TCGA-LUAD/LUSC [39,40] and CPTAC-LUAD/LSCC [41,42], respectively.While both the TCGA and the CPTAC collections cover H&E-stained lung tissue sections of the three classes considered (Figure 2), they were created by di erent clinical institutions using di erent slide preparation techniques.

Implementation
Both experiments were implemented as standalone Jupyter notebooks that are available open source [43].To enable reproducibility, care was taken to make operations deterministic, e.g., by seeding pseudo-random operations, xing initial weights for network training, and by iterating over unordered container types in a dened order.Utility functionality was designed as generic classes and functions that can be reused for similar use cases.
As the analysis method itself is not the focus of this paper, we adopted the algorithmic steps and evaluation design of a lung tumor classi cation method described in a widely cited study by Coudray et al. [7].The method was chosen because it is representative of common CompPath tasks and easy to understand.Our implementation processed images at a lower resolution, which is signi cantly less computationally expensive.
In our analysis work ow, a WSI was subdivided into non-overlapping rectangular tiles, each measuring 256 pixels at a resolution of 1 µm/px.Tiles containing less than 50% tissue, as determined by pixel value statistics, were discarded.Each tile was assigned class probabilities by performing multi-class classi cation using an InceptionV3 CNN [44].The per-tile results were nally aggregated to a single classi cation of the entire slide.The work ow is visualized in Figure 3 and a detailed description is provided in the respective notebooks.
In Experiment 1, the considered slides were divided into training, validation, and test sets with proportions of 70%, 15%, and 15%, respectively.To keep the sets independent and avoid overoptimistic performance estimates [45], we ensured that slides from a given patient were assigned to only one set, which resulted in 705, 151 and 153 patients per subset.The data collections used did not contain annotations of tumor regions, but only one reference class value per WSI.Following the procedure used by Coudray et al., all tiles were considered to belong to the reference class of their respective slide.Training was performed using a categorical crossentropy loss between the true class labels and the predicted class probabilities, and the RMSProp optimizer with minimal adjustments to the default hyperparameter values [46].The epoch with the highest area under the receiver operating characteristic (ROC) curve (AUC) on the validation set was chosen for the nal model.

IDC data access
For most CompPath studies, one of the rst steps is to select relevant slides using appropriate metadata.In the original data collections, parts of the metadata were stored in the image les and other parts in separate les of di erent formats (e.g., CSV, JSON les).In order to select relevant slides, the image and metadata rst had to be downloaded in their entirety and then the metadata had to be processed using custom tools.With the IDC, data selection can be done by ltering a rich set of DICOM attributes with standard BigQuery SQL statements (Figure 4).The results are tables in which rows represent DICOM les and columns represent selected metadata attributes.As this facilitates the accurate and reproducible de nition of the data subsets used in the analysis, these statements are described in more detail below.
An SQL query for selecting WSI in the IDC generally consists of at least a SELECT, a FROM and a WHERE clause.The SELECT clause speci es the metadata attributes to be returned.The IDC provides a wealth of metadata attributes, including image-, patient-, disease-, and study-level properties.The attribute "gcs_url" is usually selected because it stores the GCS URL needed to access the DICOM le.The FROM clause refers to a central table "dicom_all" which summarizes all DI-COM attributes of all DICOM les.This table can be joined with other tables containing additional projectspeci c metadata.Crucial to reproducibility is that all IDC data are versioned: Each new release of the IDC is represented as a new BigQuery dataset, keeping the metadata for the previous release and the corresponding DICOM les accessible even if they are modi ed in the new release.The version to use is speci ed via the dataset speci er in fully quali ed table names.All experiments in this manuscript were conducted against IDC data version 11, i.e., the BigQuery table "bigquerypublic-data.idc_v11.dicom_all".The WHERE clause denes which DICOM les are returned by imposing constraints for certain metadata attributes.To guarantee reproducibility, it is essential to not use SQL statements that are non-deterministic (e.g., those that utilize ANY_VALUE) and conclude the statement with an OR-DER BY clause, which ensures that results are returned in a sorted order.
The two experiments considered in this paper also begin with the execution of a BigQuery SQL statement to select appropriate slides and required metadata from the IDC.A detailed description of the statements is given in the respective notebooks.Experiment 1 queries speci c H&E-stained tissue slides from the TCGA-LUAD/LUSC collections, resulting in 2163 slides (591 normal, 819 LUAD, 753 LSCC).Experiment 2 uses a very similar statement to query the slides from the CPTAC-LUAD/LSCC collections, resulting in 2086 slides (743 normal, 681 LUAD, 662 LSCC).
Once their GCS URLs are known, the selected DICOM les in the IDC can be accessed e ciently using the open source tool "gsutil" [47] or any other tool that supports the Simple Storage Service (S3) API.During training in Experiment 1, image tiles of di erent WSI had to be accessed repeatedly in random order.To speed up this process, all considered slides were preprocessed and the resulting tiles were extracted from the DICOM les and cached as individual PNG les on disk before training.In contrast, simply applying the ML method in Experiment 2 required only a single pass over the tiles of each WSI in sequential order.Therefore, it was feasible to access the respective DICOM les and iterate over individual tiles at the time they were needed for the application of the ML method.

Cloud ML services
The two experiments were conducted with two different cloud ML services of the GCP-Vertex AI and Google Colaboratory.Both services o er virtual machines (VMs) precon gured with common ML libraries and a JupyterLab-like interface that allows editing and running notebooks from the browser.They are both backed with extensive computing resources including state-of-the-art GPUs or TPUs.The costs of both services scale with the type and duration of use for the utilized compute and storage resources.To use any of them with the IDC, a custom Google Cloud project must be in place for secure authentication and billing, if applicable.Since training an ML model is much more computationally intensive than performing inference, we conducted Experiment 1 with Vertex AI and Experiment 2 with Google Colaboratory.Vertex AI can be attached to e cient disks for storage of large amounts of input and output data, making it more suitable for memoryintensive and long-running experiments.Colaboratory, on the other hand, o ers several less expensive payment plans, with limitations in the provided computing resources and guaranteed continuous usage times.Colaboratory can even be used completely free of charge, with a signi cantly limited guaranteed GPU usage time (12 hours at the time of writing).This makes Colaboratory better suited for smaller experiments or exploratory research.

Evaluation
Experiment 1 was performed using a common Vertex AI VM con guration (8 vCPU, 30 GB memory, NVIDIA T4 GPU, Tensor ow Enterprise 2.8 distribution).Experiment 2 was performed with Colaboratory runtimes (2-8 vCPU, 12-30 GB memory).When using Google Colaboratory for Experiment 2, we were able to choose between di erent GPU types, including NVIDIA T4 and NVIDIA P100 GPUs.Since it has been suggested that the particular type of GPU can a ect results [48], all runs of Experiment 2 were repeated on both GPUs, respectively.Runs with NVIDIA T4 were performed with the free version of Colaboratory, while runs with NVIDIA P100 were performed in combination with a paid GCE Marketplace VM, which was necessary for guaranteed   4: Generic example of a BigQuery SQL statement for compiling slide metadata.The result set is limited to slide microscopy images, as indicated by the value "SM" of the DICOM attribute "Modality", from the collections "TCGA-LUAD" and "TCGA-LUSC".use of this GPU.
For each run of an experiment, classi cation accuracy was assessed in terms of class-speci c, one vs.rest AUC values based on the slide-level results.In addition, 95% con dence intervals of the AUC values were computed by 1000-fold bootstrapping over the slide-level results.
To speed up Experiment 2, only a random subset of 300 of the selected slides (100 normal, 100 LUAD, 100 LSCC) was considered in the analysis, which was approximately the size of the test set in Experiment 1.

Results
The evaluation results of both experiments are summarized in Table 1.It became apparent that none of the experiments was perfectly reproducible and there were notable deviations in the results of repeated runs.In Experiment 1, AUC values di ered by up to 0.045 between runs.In Experiment 2, there were also minimal deviations in the AUC values of the di erent runs, but none of these were greater than 0.001.These deviations occurred regardless of whether the runs were executed on the same GPU type or not.
The classi cation accuracy of the method trained in Experiment 1 appears satisfactory when evaluated on the TCGA test set and comparable to the results of a similar study based on the same TCGA collections [7].When applied to the CPTAC test set in Experiment 2, the same model performed substantially worse (Figure 5).
Experiment 1 took an order of magnitude longer to complete (mean runtime of 1 d 18 h ±1 h) than Experiment 2 (mean runtime of 1 h 54 min ±23 min with NVIDIA T4 and mean runtime of 1 h 28 min ±8 min with NVIDIA P100).The ML service usage charges for Experiment 1 were approximately US$ 32 per run.With the free version of Colaboratory, Experiment 2 was performed at no cost, while runs with the GCE Marketplace VM cost approximately US$ 2 per run.

Discussion
The aim of this study was to investigate how Comp-Path studies can be made reproducible through the use of cloud-based computing environments and the IDC as the source of input data.Although the same code was run with the same data using the same ML services and care was taken that operations were deterministic (see section "Implementation"), we observed small deviations in the results of repeated runs.We did not investigate whether the deviations originate from di erences in the hardware and software used by the hosts of the virtual computing environments, or whether they are due to randomness resulting from parallel processing [13].The greater variability in the results of Experiment 1 can possibly be explained by its higher computational complexity.Although the observed deviations appear negligible for many applications, they represent a practical upper limit for reproducibility.Such issues are likely to occur in any computing environment.As outlined below, we argue that the IDC can help to approach this reproducibility limit.
We chose Jupyter notebooks and cloud ML services to address the rst two reproducibility challenges mentioned in the Introduction: specifying the analysis method and setting up the computing environment.With the IDC, we were able to tackle the third reproducibility challenge with respect to the special requirements of CompPath: specifying and accessing the data.
By providing imaging data collections according to the FAIR principles, the IDC facilitates precise de nition of the datasets used in the analysis and ensures that the exact same data can be reused in follow-up studies.Since metadata on acquisition and processing can be included as DICOM attributes alongside the pixel data, the risk of data confusion can be greatly reduced.The IDC also facilitated the use of cloud ML services because it makes terabytes of WSI data e ciently accessible by on-demand compute resources.We consider our experiments to be representative of common CompPath applications.Therefore, the IDC should be similarly usable for other CompPath studies.
The results of Experiment 2 also reveal the transferability of the model trained in Experiment 1 to independent data.Although the majority of slides were cor-rectly classi ed, AUC values were signi cantly lower, indicating that the model is only transferable to a limited extent and additional training is needed.Since all IDC data collections (both the image pixel data and the associated metadata) are harmonized into a standardized DICOM representation, testing transferability to a di erent dataset required only minor adjustments to our BigQuery SQL statement.In the same way, the IDC makes it straightforward to use multiple datasets in one experiment or to transfer an experimental design to other applications.

Limitations
Using cloud ML services comes with certain trade-o s.Conducting computationally intensive experiments requires setting up a payment account and paying a fee based on the type and duration of the computing resources used.Furthermore, although the ML services are widely used and likely to be supported for at least the next few years, there is no guarantee that they will be supported in the long term and support the speci c con guration of the computing environment used (e.g., software version, libraries).Those who do not want to make these compromises can also access IDC data collections without using ML services, both in the cloud and on-premises.Even if this means losing the previously mentioned advantages with regard to the rst two reproducibility challenges, the IDC can still help to specify the data used in a clear and reproducible manner.
Independent of the implementation, a major obstacle to the reproducibility of CompPath methods remains their high computational cost.A full training run often takes several days, making reproduction by other scientists tedious.Performing model inference is generally faster and less resource intensive when compared to model training.Therefore, Experiment 2 runs well even with the free version of Google Colaboratory, enabling others to reproduce it without spending money.The notebook also provides a demo mode, which completes in a few minutes, so anyone can easily experiment with applying the inference work ow to arbitrary images from IDC.
At the moment, the IDC exclusively hosts public data collections.New data must undergo rigorous curation to de-identify (done by TCIA or data submitter) and harmonize images into standard representation (done by IDC), which can require a signi cant e ort.Therefore, only data collections that are of general relevance and high quality are included in the IDC.As a result, the data in the IDC were usually acquired for other purposes than a particular CompPath application and cannot be guaranteed to be representative and free of bias [49].Compiling truly representative CompPath datasets is very challenging [45].Nevertheless, the data collections in the IDC can provide a reasonable basis for exploring and prototyping CompPath methods.

Outlook
The IDC is under continuous development and its technical basis is constantly being re ned, e.g., to support new data types or to facilitate data selection and access.Currently, DICOM les in the IDC can only be accessed as a whole from their respective storage buckets.This introduces unnecessary overhead when only certain regions of a slide need to be processed, and it may make it necessary to temporarily cache slides to e ciently access multiple image regions (see section "IDC data access").Future work should therefore aim to provide e cient random access to individual regions within a WSI.For maximum portability, such access should ideally be possible via standard DICOM network protocols such as DICOMweb [29,50].
The IDC is continuously being expanded to support even more diverse CompPath applications.For instance, images collected by the Human Tumor Atlas Network (HTAN) that provide rich, multispectral information on subcellular processes [51] have recently been added.The IDC is integrated with other components of the CRDC, such as the Genomic Data Commons [52] or the Proteomic Data Commons [53].This opens up many more potential CompPath applications involving tissue images and di erent types of molecular cancer data [54].

Conclusion
We demonstrated how the IDC can facilitate the reproducibility of CompPath studies.Implementing future studies in a similar way can help other researchers and peer reviewers to understand, validate and advance the analysis approach.

Author Contributions
DPS and AH conceived and carried out the study.AH and AF supervised the project.AF, MDH, DAC, HH, WC, WJRL, SP and RK supported the study in di erent ways, e.g., by providing data, supporting set-up of the computing infrastructure, interpretation of the results and giving general advice.AH and DPS drafted the manuscript.All authors critically revised the manuscript and expressed their consent to the nal version.

Declaration of Competing Interest
The authors declare no con icts of interest.

Acknowledgements
The authors thank Lars Ole Schwen for advice on deterministic implementations of machine learning algorithms and Tim-Rasmus Kiehl for advice on tissue morphology.
The results published here are in whole or part based upon data generated by the TCGA Research Network

Figure 1 :
Figure 1: Overview of the work ows of both experiments and their interactions with the IDC.

Figure 2 :
Figure 2: Example tiles of the three classes considered from the TCGA and CPTAC datasets.The width of each tile is 256 µm.The black boxes marked with arrows in the whole slide images on top show the boundaries of the upper left tiles of the TCGA data set.

Figure 3 :
Figure 3: Illustration of the CompPath analysis method.Slides were subdivided into non-overlapping rectangular tiles discarding those with more background than tissue.Each tile was assigned class probabilities using a neural network performing multi-class classi cation.Slide-based class values were determined by aggregating the tile-based results.

Figure
Figure 4: Generic example of a BigQuery SQL statement for compiling slide metadata.The result set is limited to slide microscopy images, as indicated by the value "SM" of the DICOM attribute "Modality", from the collections "TCGA-LUAD" and "TCGA-LUSC".

Figure 5 :
Figure 5: One-vs-rest ROC curves for the multi-class classi cation as obtained in (a) the rst run of Experiment 1 using Vertex AI and (b) the rst run of Experiment 2 using Colaboratory (T4).

Table 1 :
Class-speci c, slide-based AUC values and 95% con dence intervals (CI) obtained through multiple runs of both experiments.