NCI’s Proteomic Data Commons: A Cloud-Based Proteomics Repository Empowering Comprehensive Cancer Analysis through Cross-Referencing with Genomic and Imaging Data

Abstract Proteomics has emerged as a powerful tool for studying cancer biology, developing diagnostics, and therapies. With the continuous improvement and widespread availability of high-throughput proteomic technologies, the generation of large-scale proteomic data has become more common in cancer research, and there is a growing need for resources that support the sharing and integration of multi-omics datasets. Such datasets require extensive metadata including clinical, biospecimen, and experimental and workflow annotations that are crucial for data interpretation and reanalysis. The need to integrate, analyze, and share these data has led to the development of NCI’s Proteomic Data Commons (PDC), accessible at https://pdc.cancer.gov. As a specialized repository within the NCI Cancer Research Data Commons (CRDC), PDC enables researchers to locate and analyze proteomic data from various cancer types and connect with genomic and imaging data available for the same samples in other CRDC nodes. Presently, PDC houses annotated data from more than 160 datasets across 19 cancer types, generated by several large-scale cancer research programs with cohort sizes exceeding 100 samples (tumor and associated normal when available). In this article, we review the current state of PDC in cancer research, discuss the opportunities and challenges associated with data sharing in proteomics, and propose future directions for the resource. Significance: The Proteomic Data Commons (PDC) plays a crucial role in advancing cancer research by providing a centralized repository of high-quality cancer proteomic data, enriched with extensive clinical annotations. By integrating and cross-referencing with complementary genomic and imaging data, the PDC facilitates multi-omics analyses, driving comprehensive insights, and accelerating discoveries across various cancer types.


Introduction
Precision medicine aims to tailor medical treatment to the unique characteristics of individual patients, with the goal of improving outcomes and reducing healthcare costs.One key challenge in precision medicine is the need for large, high-quality datasets that can be used to identify biomarkers and other molecular features associated with disease susceptibility, progression, and response to treatment.Toward this goal, large consortiums, such as The Cancer Genome Atlas (ref. 1) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC; ref. 2), have generated vast amounts of multi-omics data from thousands of patient samples across multiple cancer types.Historically, several proteomic repositories, such as PRIDE (3), PeptideAtlas (4), and MassIVE (University of California San Diego, USA, in 2014), under the umbrella of ProteomeXchange (5), an international consortium, have supported the 1 ICF, Rockville, Maryland. 2Georgetown University, Washington, District of Columbia. 3Spectragen Informatics LLC, Bainbridge Island, Washington. 4 University of Washington, Seattle, Washington. 5 Leidos Biomedical, Inc., Rockville, Maryland. 6Center for Biomedical Informatics & Information Technology, National Cancer Institute, Rockville, Maryland. 7Office of Cancer Clinical Proteomics Research, National Cancer Institute, Rockville, Maryland.
data sharing in proteomics.However, there is a growing need for resources that support the sharing and integration of multi-omics datasets with extensive clinical annotations, with emphasis on data reuse.This necessitates a resource that encompasses data from thousands of patient samples across various national and international programs, covering multiple cancer types and stages.The resource should (1) (10).By integrating the PDC resource with other data commons efforts, we aim to accelerate the pace of discovery in cancer research and improve the lives of patients with cancer.
In this article, we describe the design and implementation of the PDC, highlight its key features, and describe how it interoperates with other resources within the NCI's cancer ecosystem, the CRDC.

Data model and data dictionaries
We have developed a robust data model and data dictionary to ensure effective

Design and development
PDC has been built on Amazon Web Services cloud platform to take advantage of its benefits in terms of scalability, security, accessibility, and collaborative research.The infrastructure is Federal Information Security Management Act compliant with stringent access controls, encryption protocols, regular audits, and incident response procedures, guaranteeing the confidentiality, integrity, and availability of data from potential threats and vulnerabilities.
During the initial design phase in 2017 to 2018, a minimum viable product was created with two key components to support data submission and data distribution.A variety of proteomic data publicly available from the CPTAC program was populated during the design phase to serve as testing data.The minimum viable product resource was made available to the public, and the feedback collected was used to improve the system during the actual build phase, which was officially launched in March 2020.

Data harmonization
By adhering to the principles of Findability, Accessibility, Interoperability, and Reusability (FAIR) of data (ref.

Common data analysis pipeline
The CDAP was initially developed to encourage the reuse of the data ac- The available data types in PDC (Table 1) include raw MS files in proprietary (vendor) format as submitted by the authors and several outputs from the CDAP processing which include raw data transformed to HUPO PSI mzML format, PSM information in tab separated and HUPO PSI mzIdentML format, protein summary reports for proteins and PTM sites and quality metrics files reporting the statistics of the MS/MS spectra of the datafiles.
Quantitative information in the PSM and protein reports contain the spectrum-level or gene-level ("rolled-up") precursor peak areas and spectral counts for label-free or reporter ion log-ratios for labeled multiplexing experiments.
In addition to the CDAP harmonized data, PDC also encourages authors to submit the processed data discussed in their publication.These data are accessible as supplementary information on the study summary pages and through the PDC publications page.The latter catalogs articles from available studies in PDC and offers convenient access to the processed data described in those articles.This page is designed to assist researchers who often discover data through research papers.
In addition to data files, the PDC annotates the CDAP protein reports with rich clinical data and offers the ability to visualize protein quantitation data using a data analysis and matrix visualization tool (https://software.broadinstitute.org/morpheus)for exploring relative quantitation data as heatmaps.Users can cluster data based on the clinical annotations, generate new annotations, and interact with features such as searching, filtering, and sorting using gene names and sample identifiers, along with displaying charts and other functionalities.Individual PDC study summary pages provide links to the heatmaps when available.All the heatmaps are also available through dedicated analysis page for quantitative data exploration for easy access.
PDC data are also accessible for in-depth analysis through specialized tools like PepQuery (28) 2).

Study identifiers and versioning
PDC ensures data reliability and traceability through persistent identifiers and versioning.Persistent identifiers help in citing data, whereas versioning tracks changes over time, enabling researchers to access specific dataset versions for reproducible analyses.Older versions remain accessible if changes are related to advancements in analysis methods or if their availability is deemed necessary for data reuse.download information via manifest files and APIs (Fig. 3).Whereas PDC offers curated data, NCI Cloud Resources (CR) provide cloudbased computational infrastructure and analysis tools (30).PDC enables interoperability with CRs, allowing users to identify and transfer data seamlessly for analysis on platforms like Seven Bridges Genomics Cancer Genomics Cloud (31) or Broad Institute's Firecloud (bioRxiv 209494).Additionally, PDC APIs integrate with ISB-CGC (32), providing access to quantitative data alongside genomic data on the Google BigQuery infrastructure.This integration empowers users to perform multi-omics analyses by combining protein quantitation with complementary genomic data within the CRDC ecosystem.More details on this integration and usage can be found on the PDC portal.

Data submission portal
The PDC provides extensive documentation of the resource; its data model, its various features, APIs, and the bioinformatics pipelines used to harmonize the data are provided in Table 2.

Discussion
Over the past decade, data sharing has transformed from a recommended practice to a mandate by many scientific journals and funding organizations to ensure the accessibility of high-value datasets to aid in accelerating the pace of biomedical research.This has led to significant volumes of MS-based proteomic data being freely available in the public domain (5).However, the effective use, reuse, reprocessing, and repurposing of this complex data for new discoveries, especially in cancer research, are contingent upon the presence of highly curated and standardized metadata and clinical annotations.The PDC has been established to address this gap.In this article, we described a comprehensive data resource that distributes MS-based cancer-related proteomic data from large cancer proteogenomic programs and facilitates proteogenomic integration through interoperability with other data commons and analytical resources within the NCI CRDC.The resource serves both as a data repository and as a knowledge base.The extensive curation of the biospecimen, clinical, and proteomic metadata provides an intuitive cohort exploration using biospecimen, clinical, gene, data, and proteomic attributes and facilitates combing data from diverse sources for meta-analysis.We have recently expanded the PDC data model and dictionaries to incorporate MS-based metabolomic and lipidomic data alongside the complementary proteomic data.Despite metabolomics and lipidomics being distinct omics types, the PDC data model has been adjusted to accommodate these differences.This update now allows PDC to host comprehensive MS-based multi-omics datasets generated for the same cohorts from the CPTAC consortium, enhancing the integration and utility of diverse data types.
To promote the reuse of data, we designed the data model and dictionaries to align with existing resources and community standards such as the cancer Data Standards Registry and Repository and HUPO PSI.This approach simplifies the CRDCs efforts to create a common data model that users can search using variables such as participant, sample, tissue, disease, or race.This aggregation also makes it possible for researchers to create complex multistudy datasets from both open-and controlled-access datasets across the CRDC that can be used for integrative analysis.
Despite the robust capabilities of the PDC, several challenges remain.The extensive metadata requirements, while ensuring data reliability and reusability, impose a significant burden on both data submitters and PDC data managers.Ensuring the completeness and accuracy of data relies heavily on the submitters' expertise, necessitating continuous updates to our data dictionaries and comprehensive training resources, which is resource-intensive for the PDC.Additionally, strict enforcement of standards can delay submissions but is crucial for maintaining data quality.The integration of multimodal data presents unique challenges because of staggered releases and varying identifiers across repositories, adding complexity to data management within the PDC.
The subject matter expertise needed to manage the continuously evolving data standards in clinical, MS, proteomics, and metabolomics domains is critical.Adapting to these evolving standards and retrospectively updating historical data to align with new standards further complicate data management and require substantial effort and specialized knowledge.Through a suite of core standards and services, the CRDC is exploring ways to streamline the submission process, improve data interoperability, and develop tools to facilitate seamless data integration (7).
The lack of a common reference sample across proteomic datasets in the PDC is another significant challenge, as it hinders the ability to directly compare protein abundance ratios across different cancer types for the same gene, even though all data are analyzed through a CDAP.This discrepancy complicates cross-study analyses and limits the ability to draw consistent, meaningful biological conclusions.The CPTAC program is working on ways to better understand and enable experiment bridging that will facilitate such comparisons and informative visualizations.
Recent advancements in artificial intelligence technologies offer great potential to achieve the National Cancer Plan's goal of maximizing data utility for faster progress against cancer.The PDC is working with the CRDC to lay the foundation for Artificial Intelligence Data Readiness by adhering to FAIR principles and mandating submission requirements and by encouraging data submitters to ensure the accuracy, completeness, consistency, and validity of the data.
Researchers can utilize the PDC via its interactive portal at https://pdc.cancer.gov,engage through APIs, and access them through other platforms within NCI CRDC.
facilitate integration of proteomic, genomic, and other omics data to advance the discovery of new biomarkers and therapeutic targets, and (2) provide comprehensive annotation with clinical metadata, allowing researchers to delve into the connections between molecular features and clinical outcomes.With this objective in mind, the NCI has established Cancer Research Data Commons (CRDC; refs.6, 7), a cloud-based data science infrastructure that provides secure access to a large, comprehensive, and expanding collection of cancer research data and analytical tools.Proteomic Data Commons (PDC) is a key component of this effort, providing a central repository for proteomic data that can be easily accessed and analyzed by researchers across the cancer research community (Fig. 1).Along with other data nodes such as the Genomic Data Commons (GDC; ref. 8) and Imaging Data Commons (IDC; ref. 9), the PDC provides a comprehensive repertoire of cancer proteomic data organization and standardized representation of the proteomic data.The PDC data model serves as a structured framework for capturing and organizing diverse types of proteomic information, including experimental metadata, biospecimen and clinical information, raw data, and analytical results.The data are represented as various entities such as administrative (program and project), biospecimen hierarchy (case, sample, and aliquot), clinical (demographic, diagnosis, follow-up, treatment, exposure, and family history), experimental design, protocol, and file metadata.It ensures consistency and facilitates data harmonization across different datasets, allowing researchers to easily compare and analyze proteomic data from multiple sources.The data model outlines the complex sample to data file relationships resulting from sample multiplexing and fractionation that are common in proteomic experiments.The accompanying data dictionary provides a comprehensive guide to the definitions, formats, and conventions used within the PDC data model for each of the data elements for the entities.It serves as a reference for data contributors and users, ensuring clear and standardized data representation.The data dictionaries are based on community standards, ontologies, and controlled vocabularies, such as those from cancer Data Standards Registry and Repository (ref.11), International Classification of Diseases (ref.12), and HUPO Proteomics Standards Initiative mass spectrometry (PSI-MS; ref. 13).We also align with the CRDC Data Standards Service, which aims to define standard data elements across all CRDC Data Commons (14).
15), the PDC harmonizes proteomic data from various programs such as the CPTAC, Children's Brain Tumor Network, International Cancer Proteogenome Consortium (ICPC), and Applied Proteogenomics OrganizationaL Learning and Outcomes (APOLLO; ref. 16), using standardized workflows and community-accepted ontologies.This process effectively eliminates the "data pipeline" variable, thus facilitating comparisons across datasets.The first step in the harmonization process assigns standard identifiers; performs data integrity checks; and ensures adherence to standards (community-accepted vocabulary and nomenclature for clinical attributes, peptides, proteins, protein sequence variants, and modifications and open data formats for files) and the PDC data model.The second step involves organizing the data to prevent duplication.Given that most cancer cohorts in PDC undergo multiple characterizations [such as global proteomics, posttranslational modifications (PTMs), metabolomics, and lipidomics], the data are meticulously organized to represent cases or subjects and their associated clinical data without duplication across these different data types.This approach not only prevents data redundancy but also simplifies data management and retrieval, enhancing the efficiency of research and enabling comprehensive multi-omics analyses.The final step in the harmonization is to process the submitted raw MS data files through a Common Data Analysis Pipeline (CDAP) to produce derived analysis results, which can be used to study the identification of proteins and PTMs.Currently, CDAP harmonization is limited only to the proteomic data.PDC primarily receives data acquired through data-dependent acquisition (DDA; ref. 17) and data-independent acquisition (DIA; ref. 18) MS methods and employs different CDAPs to process them.Following analysis by either CDAP, the results-mzML spectral data files, Peptide Spectral Match (PSM) results, and summary reports for proteins and PTM sites-are made available on the PDC Data Portal, along with the original raw data and metadata.

FIGURE 1
FIGURE 1 Overview of NCI's PDC: harmonized data distribution and interoperability within NCI's CRDC.

.
tmt.tsv-TMT workflow protein relative quantitation report .peptides.tsv-identifiedpeptide summary report .phosphopeptide.tsv-labeled workflow phosphopeptide relative quantitation report .phosphosite.tsv-labeledworkflow phosphopeptide relative quantitation report .glycopeptide.tsv-labeled workflow N-linked glycopeptide relative quantitation report .glycosite.tsv-labeled workflow N-linked glycosite relative quantitation report DIA analysis: precursors_unnormalized.tsv-unnormalized precursor peak areas precursors_normalized.tsv-mediannormalized precursor peak areas proteins_unnormalized.tsv-unnormalizedprotein abundances.Calculated by taking the sum of every precursor in the protein proteins_normalized.tsv-DirectLFQ normalized protein abundances sky.zip-the skyline document used for quantification of chromotographic peaks QC reports Quality control metrics computed by the CDAP; the report consists of summary statistics derived from all MS/MS spectra from the raw spectral data files.Supplementary data Other metadata provided by data submitters, including descriptive protocols, clinical metadata, and other useful information.Data submitters may also provide processed outputs from the analysis pipelines used in their peer-reviewed publications.Application programming interface PDC offers GraphQL-based Application Programming Interfaces (API), which provides more flexibility than REST APIs to request specific data, reducing over-fetching and enabling more efficient data fetching.Users can find swagger-based documentation, a playground to try the APIs and sample Python notebooks to provide guidance on running API queries, visualizing data, and performing statistical analyses (Table PDC and CRDC aim to simplify data reuse by facilitating access to complex datasets and analysis tools.Programs like CPTAC and APOLLO distribute multi-omics data across CRDC nodes-PDC for proteomic data, GDC for genomic data, and IDC for imaging data.PDC displays the cross-referencing to these resources for individual cases on portal, allowing users to

FIGURE
FIGURE Cross referencing to genomic and imaging resources for individual cases.Example of cross-referencing (A) on the Clinical tab of PDC's Explore page; B, on the Clinical tab and External References section of PDC study summary pages.
data submission portal was developed using the Chorus open-source project (https://chorusproject.org/), a MS-based proteomic resource where researchers can upload raw data, organize them by the instrument make and model, create projects, and share with other members of their program.Substantial enhancements were made to facilitate the inclusion of various metadata including the protocols, clinical and biospecimen, experimental design, and file attributes in adherence with the PDC data model and data dictionaries.The data submission portal serves as the workspace for data submitters, especially those who do multiple data submissions over time, to organize the studies into programs and projects.We used the open-source concept to save development time and promote the reuse of established resources.

TABLE 1
Available data types in the proteomic data commons .summary.tsv-proteinidentification summary report .precursor_area.tsv-label-free workflow protein quantitation report for relative quantitation by precursor peak area integration .spectral_count.tsv-label-free workflow protein quantitation report for relative quantitation by spectral counts .itraq.tsv-iTRAQ workflow protein relative quantitation report

TABLE 2
Documentation resources in the PDC