PRISM: A Platform for Imaging in Precision Medicine

PURPOSE Precision medicine requires an understanding of individual variability, which can only be acquired from large data collections such as those supported by the Cancer Imaging Archive (TCIA). We have undertaken a program to extend the types of data TCIA can support. This, in turn, will enable TCIA to play a key role in precision medicine research by collecting and disseminating high-quality, state-of-the-art, quantitative imaging data that meet the evolving needs of the cancer research community METHODS A modular technology platform is presented that would allow existing data resources, such as TCIA, to evolve into a comprehensive data resource that meets the needs of users engaged in translational research for imaging-based precision medicine. This Platform for Imaging in Precision Medicine (PRISM) helps streamline the deployment and improve TCIA’s efficiency and sustainability. More importantly, its inherent modular architecture facilitates a piecemeal adoption by other data repositories. RESULTS PRISM includes services for managing radiology and pathology images and features and associated clinical data. A semantic layer is being built to help users explore diverse collections and pool data sets to create specialized cohorts. PRISM includes tools for image curation and de-identification. It includes image visualization and feature exploration tools. The entire platform is distributed as a series of containerized microservices with representational state transfer interfaces. CONCLUSION PRISM is helping modernize, scale, and sustain the technology stack that powers TCIA. Repositories can take advantage of individual PRISM services such as de-identification and quality control. PRISM is helping scale image informatics for cancer research at a time when the size, complexity, and demands to integrate image data with other precision medicine data-intensive commons are mounting.


INTRODUCTION
The Precision Medicine Initiative in Oncology is envisioned to "encourage and support . . . new approaches for detecting, measuring, and analyzing a wide range of biomedical information-including molecular, genomic, cellular, clinical, behavioral, physiological, and environmental parameters." 1(p794) Precision medicine requires the ability to classify patients into specialized cohorts that differ in their susceptibility to a particular disease, in the biology of the disease, response to therapy, 2 and so on. Imaging data and, in particular, quantitative imaging features have been identified as a critical source of information when creating such cohorts for precision oncology. Radiomics and pathomics, where quantitative features are extracted from radiology [3][4][5] and digital pathology, 6,7 provide valuable diagnostic and prognostic indicators of cancer. [8][9][10][11][12][13] Identifying such quantitative imaging phenotypes across scale through the use of radiomics, deep learning, and so on also provides an alternative approach to improve our understanding of cancer biology. 14,15 However, these methodologies of leveraging quantitative imaging for clinical and basic research require large collections of well-curated diverse data sets for reproducible development and validation.
Although a growing number of cancer imaging and precision medicine information resources are coming on line, [16][17][18] the Cancer Imaging Archive (TCIA) has been the primary resource of the National Cancer Institute (NCI) for acquiring, curating, managing, and distributing images and related data to support cancer research since its creation in 2011. TCIA radiology and pathology images are collected from . 46,500 human subjects as well as associated clinical data, imagederived features, and annotations. 19 TCIA also manages a growing number of preclinical image collections, including patient-derived xenograft models. It is visited by approximately 20,000 users per month from approximately 130 countries, exports . 1 PB of data per year and has provided data to . 900 peer-reviewed publications and graduate theses. It is the primary image repository for several NCI programs, [20][21][22][23][24] clinical trials, 25 and various challenges. 20,[26][27][28][29][30] Even though TCIA has been highly successful, it has some inherent challenges that limit its ability to support the growing field of precision oncology and data sciences. These challenges are not only inherent to TCIA but also observed in institutional data repositories and other large data-sharing activities. In response to these challenges, in 2017 we began work on the Platform for Imaging in Precision Medicine (PRISM). This article summarizes our ongoing developments in PRISM, in particular: novel solutions for managing radiomics and pathomics data sets, managing and integrating clinical data sets, supporting semantic search to ease data discovery, and evolving the curation pipelines to improve throughput. Finally, although TCIA remains the primary driver of PRISM, one of the primary objectives of PRISM is also to modernize and modularize the underlying technology stack so that individual components can be adopted piecemeal.

CHALLENGES
The design and development of PRISM stem from the core premise that well-curated data repositories, with semantically linked collections that permit researchers to integrate information across scale, are essential to cancer imaging and precision medicine research. Simply archiving images is no longer sufficient in today's precision medicine approach to cancer treatment. Researchers have identified the need to analyze integrated data sets consisting of tightly coupled radiology and pathology images with clinical context and features extracted from the images. Through a variety of discussions, TCIA feature requests, surveys, and so on, the following challenges were identified. These challenges have been instrumental in guiding and prioritizing the design and development of PRISM: • Comprehensive data management and curation to include clinical data, a full range of imaging modalities, pathology images, and radiomic and pathomic features. • Better tools for curating high-quality data sets at large scales. • Integration across clinical, radiology, pathology images, and derived feature sets to support queries involving interrelationships between clinical course, response to treatment, and the acquired images and computed features. • Semantic search that links images, clinical data, and derived features and helps in data discovery and interoperability. • Tools to encourage data sharing and promote reproducible research. • A modular architecture that allows piecemeal adoption of capabilities as well as a near-seamless ability to move between cloud and an on-premise deployment.

PRISM
PRISM is taking a systematic approach to address these challenges via a new architectural framework that builds on the principles of microservice architecture and a rich ecosystem of application programming interfaces (APIs). As illustrated in Figure 1, it targets a better modularization of existing software and more efficient incorporation of new services, extensibility, and scalability.
Applications in the top layer may use any of the underlying services to accomplish a task. Multiple applications may perform similar functions but targeted to different user communities. All functions in the top two layers are interconnected by APIs. In the PRISM architecture, we

CONTEXT Key Objective
Open access information repositories advance cancer research by enabling the creation of new study cohorts and reuse of data to address new research questions. The Cancer Image Archive has served as the National Cancer Institute's open image repository for the past decade, and through the Platform for Imaging in Precision Medicine project its technology base and capabilities are being greatly enhanced. Knowledge Generated Advanced research into imaging phenotypes and quantitative image analyses in both radiology and pathology are generating a new type of data: image-derived feature sets. The tools for semantic integration of clinical and quantitative image data across scale we are developing will enable new research directions and support advanced machine learning algorithm development.

Relevance
Quantitative imaging and omics data (eg, radiogenomics) are proving to be essential new tools to advance our understanding of cancer mechanisms and improve our ability to diagnose and track response to cancer therapy. State-of-the-art, openaccess information repositories are essential to enable these techniques to produce actionable clinical knowledge.
have chosen to enhance this framework with an API Gateway, 31 which can also deal with user authentication for services. The middle layer includes server-side functions supported by databases and Resource Description Framework triple stores 32 and accessed via the API gateway (except landing pages, wiki, and service desk). Finally, the bottom layer comprises the object stores and external services. The PRISM architecture is explicitly designed to manage data housed in an object store and accessed by standard interfaces such as S3 and OpenStack Object Storage. 33 Image and Feature Management The design and development of PRISM are driven to support "image-omic studies," a research design that involves the integration of clinical data, imaging data, quantitative features extracted from the images, and molecular data. Such studies enable a highly data-driven approach to diagnosis and outcome prediction 34 and are a key component of precision medicine. Indeed, many research groups have developed methods for linked characterizations of imaging features, clinical outcome, and omics signatures and studied their relevance in clinical research. [6][7][8][9][10][11][12][35][36][37][38][39][40][41][42][43][44][45][46][47] Locating and accessing data cohorts with the relevant information requires that besides imaging metadata, any associated clinical and demographic data be indexed and part of the data query process. Although it would be desirable to index and search across imaging features, it becomes difficult to harmonize features and make them part of the query process. It is much easier to index the availability of features and their provenance, so users can make that information part of the query process. However, imaging features must be part of a data cohort. To maintain linkages across the various data types and manage the data across multiple collections, PRISM builds on the TCIA data model, as shown in Figure 2.
Image data management. The PRISM data model organizes data as collections. A collection typically includes studies from several subjects (patients), and each subject has data of multiple data types, such as radiology and pathology images, radiomic and pathomic features, and clinical data. Radiology image data are represented as Digital Imaging and Communications in Medicine (DICOM) objects and are managed using the open-source National Biomedical Imaging Archive (NBIA) software package. 48 NBIA functions as an application layer that sits over a MySQL relational database. PRISM is expanding the radiology data management capabilities and adding support for the new DICOMweb 49 representational state transfer (REST or RESTful) APIs. The use of such standardized APIs will allow the adoption of off-the-shelf DICOM viewers and directly query and retrieve DICOM data.
Unlike radiology, there are no common standards for pathology image data. Therefore, PRISM includes PathDB, a pathology data management system that manages and organizes whole slide images and pathomic features and the provenance of the features. Included with PathDB is a web application called FeatureMap. FeatureMap allows users to view and interact with feature maps. A feature map is a composite representation in the form of a low-resolution image of one or more classification probability maps; probability maps are generated on whole-slide tissue images by deep learning methods. 50 Access control for image data and associated nonimaging data, such as features and any available clinical data, are managed at the collection level. If a user has access to a particular collection, then all data under that collection are also made available. User access information is  Radiomics and pathomics features. FeatureBase is responsible for storing and indexing large volumes of imaging features so that user-facing query and visualization applications can efficiently interact with them. Pathomic features can include individual segmented nuclei/cells and their morphology as well as features indicating patterns and the likelihood of macro structures, such as lymphocyte patterns, or characterization of tumoral and stromal regions. Pathomics can become very large. For example, segmenting nuclei in a data set of 1,000 images can easily generate more than a billion segmented objects and tens of billions of imaging features. To address the complexity and scale of pathomics data, PRISM has adapted the FeatureDB service of QuIP 52 to implement FeatureBase.
Although FeatureBase was developed to support pathomics features, there is a significant overlap between the 2 data types and how researchers interact with features. FeatureBase can index individual objects and store them as polygons, whereas features computed for segmented objects are stored as feature vectors, spatial patterns, or probability maps. A probability map partitions an image into a uniform mesh of image patches. Each image patch is assigned a probability value (by a machine/deep learning method), which indicates the probability of the image patch belonging to a class (eg, grade 3 tumor). For pathomic features, the various imaging features are represented as GeoJSON-compliant JSON documents that are then managed and indexed in a MongoDB database. Unlike pathology, in radiology, the DICOM community has standards for representing segmentations and probability maps, as well as structured representations of computed features. Therefore, instead of using GeoJSON, we are adopting DICOM standards for representing radiomic features but indexing and managing them in MongoDB. The use of a shared environment for radiomic and pathomic features is expected to improve linkages between radiomics and pathomics data for integrated exploration and analysis.

MAKING THE DATA FAIR
The stewardship of image data needs to adhere to the FAIR (findable, accessible, interoperable, reusable) principles, 53 to achieve its full potential as a scientific resource. This is a key design tenet of PRISM. PRISM-based resources have to be agile to meet the changing needs and technologies that are in use by the community, such as the increasing reliance on REST APIs and advanced computational statistics engines to support programmatic interoperability at scale. In particular, data assets produced and consumed by image analysis need to be available as components of an "API ecosystem" as part of the overarching normalization of Research Data Commons. 54 The fluid nature of these new software engineering environments comes with its own challenges, such as the need for continuous API design and distributed authorization. 55

Findable: Semantic Integration and Search
Semantic integration in PRISM aims to make image collections and associated nonimage data more findable, accessible, interoperable, and reusable. Our approach PRISM: A Platform for Imaging in Precision Medicine goes beyond the specific need to make data findable by also addressing the underlying challenge of integrating and managing diverse nonimage data associated with image collections. PRISM integrates and manages nonimage data using ontology-based representation patterns that account for explicit and implicit connections among the data across the source data sets. 56 Instances in the data are linked to ontology classes that define and represent the entities that the data are about (eg, anatomic locations, disease types, diagnosis). The Open Biomedical Ontologies (OBO) Foundry 57 is a collection of axiomatically rich ontologies adhering to common design principles and using a consistent shared representational strategy based on Basic Formal Ontology 58 to achieve interoperability across subject areas. OBO ontologies are available for reuse under a permissive license (CC BY 4.0). PRISM uses many OBO resources, including the Human Disease Ontology, 59 the Ontology for Biomedical Investigations, 60 and the Uber Anatomy Ontology (Uberon). 61 Work is ongoing to develop ontology-driven semantic search tools that make use of the representations underlying our semantic integration efforts. Richer user-facing tools for search and exploration of nonimage data in image collections will allow queries across collections that combine demographics, tumor location, disease types, and other similar data. We have developed a proof-of-concept query interface that allows users to identify records matching criteria on the basis of fields in nonimage data that were previously not queryable-for instance, finding records across head and neck cancer collections for male patients . 55 years of age with a positive HPV diagnosis and a primary tumor in the oropharynx. Figure 3 illustrates the ontology-driven semantic search strategy, in which a simple search interface populated using ontologies and linked instances generates SPARQL queries to search the ontology-linked nonimage data (stored in a triple-store database), as well as structured query language (SQL) queries for image metadata stored in a relational database. The results link directly to downloadable/viewable images from matching records. ARIES (Arkansas Image Enterprise System), 62 a PRISM instance hosting neuroimaging data for University of Arkansas for Medical Sciences researchers, provides an early testbed to deploy and refine the PRISM approach to semantic integration.

Accessible: Visualization and Data Exploration Apps
PRISM includes a variety of user-facing web applications that allow researchers to explore a repository and create and examine cohorts. Web applications enabled by the modern browser have the advantage of being assembled in the browser's sandbox, which comes with significant advantages when operating cloud resources safely. Such web applications, often described as progressive web apps, are an ideal environment to engage PRISM's APIs to drive the various web viewers and data exploration tools.
PRISM now includes the Open Health Imaging Foundation viewer 63 for visualizing radiology objects and the caMicroscope viewer 50,52 for visualizing digital pathology images. These viewers interface with the respective image management systems (Fig 4). A high-speed bulk download mechanism is available to help users reliably download large amounts of radiology data. A similar mechanism to support the download of pathology data is under development. For interactive data exploration, a suite of taskspecific data portals, such as the Clinical Proteomic Tumor Simple search interface SQL -image metadata Search results

Image viewer SPARQL queries
Triple store -graph DB with ontologies and instance data Analysis Consortium Pathology Portal, 64 have been built using a declarative visualization tool called DataScope. 65 These provide the foundation for a series of generic data exploration environments that are being built and will be released in the coming months as part of the PRISM tech stack. Finally, the accessibility to PRISM-managed data, via APIs, has allowed third parties to develop integrations with research frameworks such as BioConductor, 66 third-party applications such as 3DSlicer, 67 and data science environments such as Jupyter notebooks. 68

Interoperable: Data Curation
Careful curation and strict quality-control processes have been instrumental activities that have led to the success of TCIA. PRISM builds on the TCIA experience and includes tools that are capable of curating diverse data sets at large scales. The modular design of PRISM allows us to disseminate these capabilities and make them available as standalone modules that can be used as drivers of individual research imaging repositories. This includes dissemination of knowledge to the wider research community in areas of DICOM de-identification 69 and open data. 70 PRISM is adopting and modernizing the suite of advanced tools, procedures, and scalable workflows for semiautomated data curation, quality control, and enhancement, which have allowed the repository to continuously grow. Data curation in PRISM uses the Posda tool suite 71 to implement its curation workflows. Posda is a set of curation workflow tools developed to provide a mechanism to ensure the scientific utility of data and to eliminate protected health information as well as improving the scalability of curation workflow. Posda supports a single curation pipeline dealing with all object types defined by the DICOM standard (images, radiation therapy objects, structured reports, segmentation, and so on). This pipeline performs integrity checks automatically on a bulk basis, applies revisions to data sets, tracks all changes in a revision tracker permitting rollback if needed, and rapidly identifies potential duplicate data sets on the basis of stored hash codes, without identifying the individual.
PRISM is extending Posda with new workflows to support pathology and pathomic features. Curation tools are being interfaced with semantic integration and ontology toolkits as new Posda pipelines and curation procedures. The overarching objective of curation is to ensure compliance governing disclosure of protected health information and ensure that data formats are reusable and have enough semantic metadata so that researchers can unambiguously find the data they need.

Reusable: Digital Object Identifiers
To incentivize data sharing and promote research reproducibility, many publishers now encourage authors to provide data citations. PRISM leverages the popular Digital Object Identifier (DOI) management system called DataVerse 72 for "publishing" user-generated results and issuing and managing DOIs. DOIs are well-recognized mechanisms to make the provided data unique, persistent, and citable. 73 DataVerse is being integrated with FeatureBase to better support image-omic features and the various other data-management systems. The metadata schema used by DataVerse allows PRISM to include attributes that facilitate versioning and others that capture the relationships between the data set being registered and related publications/data sets.

OPERATING AT SCALE: THE PRISM TECH STACK
TCIA was originally implemented as a collection of mirrored and load-balanced virtual machines (VMs) and shared bulk storage for all of the VMs. This has allowed TCIA to maintain a 99.5% uptime. The main headache with using VMs is that the collection of systems making up TCIA is difficult to deploy and requires intimate knowledge of the interconnections between systems to keep TCIA updated and running. More importantly, the tech stack is tightly coupled, and this makes it difficult to distribute and adopt piecemeal individual capabilities.
In PRISM, the tech stack is being modularized and driven as a set of RESTful web services, including data services, that interface with data stored on a modern object storage system. These services are accessed via APIs that are made available through a centralized API gateway. Additional core services, such as load balancers and centralized security services, are also made available. PRISM will rely on Kubernetes, 74 an orchestrated container management environment where the interconnections and interfaces between containers making up subsystems, as well as the interconnections between subsystems, are automatically configured using scripts. This simplifies deployment and maintenance of PRISM-based sites regardless of whether the sites are hosted locally on dedicated hardware or in virtualized or cloud-based environments.
All PRISM components developed by our team are released open source under the BSD 3-Clause "New" or "Revised" License or the Apache 2.0 License. Available examples include the Posda curation toolkit, 75 the QuIP Pathology and pathomics management services, 76 and the caMicroscope pathology viewer. 77 Additional modules are similarly distributed.
Components such as the Kubernetes orchestration software and API gateway 78 are open-source tools developed by others.
In conclusion, realizing the promise of precision medicine in enabling better treatment strategies for cancer, a complex multifactorial disease state, will largely depend on how well we synthesize information across multiple scales from the patient down to the molecular level. Today, treatment strategies are often developed by gleaning information through qualitative and subjective interpretations of images combined with molecular characterizations and clinical data. Although molecular characterizations inform prognosis and targeted therapy decisions, image information is a crucial component in the overall decision-making process. Radiomics and pathomics studies provide highly detailed, quantitative, and reproducible descriptions and characterizations of tumor structure and function at complementary biologic scales. The complexity and sizes of primary and derived data sets in radiomics and pathomics dictate scalable and extensible software infrastructures to curate, manage, and share said data sets. PRISM provides capabilities that allow researchers to address these issues of data management and integration, thus allowing them to quantitatively incorporate imaging data. These capabilities will enable the cancer research community to synthesize information across multiple scales, a key tenet of precision medicine for cancer.
Consider a research team studying lung cancer. A PRISMbased repository will allow the team to use semantic query capabilities to pool data from multiple collections to create the requisite cohort of, say, patients with lung adenocarcinoma, with linkages across various images, features, feature provenance, and molecular characteristics. The research team can manage, explore, and refine results from their analyses within their collaboration. They will be able to upload their analysis results and images to the community PRISM instance if they would like to share them with the research community at the completion of their study. No other potential conflicts of interest were reported.

ACKNOWLEDGMENT
The authors thank the entire PRISM team for their many contributions.