ProteomicsDB: toward a FAIR open-source resource for life-science research

Abstract ProteomicsDB (https://www.ProteomicsDB.org) is a multi-omics and multi-organism resource for life science research. In this update, we present our efforts to continuously develop and expand ProteomicsDB. The major focus over the last two years was improving the findability, accessibility, interoperability and reusability (FAIR) of the data as well as its implementation. For this purpose, we release a new application programming interface (API) that provides systematic access to essentially all data in ProteomicsDB. Second, we release a new open-source user interface (UI) and show the advantages the scientific community gains from such software. With the new interface, two new visualizations of protein primary, secondary and tertiary structure as well an updated spectrum viewer were added. Furthermore, we integrated ProteomicsDB with our deep-neural-network Prosit that can predict the fragmentation characteristics and retention time of peptides. The result is an automatic processing pipeline that can be used to reevaluate database search engine results stored in ProteomicsDB. In addition, we extended the data content with experiments investigating different human biology as well as a newly supported organism.


INTRODUCTION
ProteomicsDB (https://www.ProteomicsDB.org) has developed into a multi-omics and multi-organism resource for life science research (1). It is built upon the in-memorydatabase technology HANA (2) enabling fast access to stored data and thus offering real-time data analytics capabilities. ProteomicsDB was originally developed to investigate large quantities of human quantitative mass spectrometry-based proteomics data, highlighted on one of the first drafts of the human proteome (3,4). However, over the past years it was extended to include additional organisms including Mus musculus and Arabidopsis thaliana (5) as well as additional omics types, such as transcriptomics and phenomics data (1,4). Because of this, ProteomicsDB has become a rich and valuable resource for life science research and extends beyond the scope of proteomics experiments. This is visible by the external resources integrating with ProteomicsDB, such as GeneCards (6), UniProt (7), OmniPathDB (8) and Gene Information eXtension (GIX) (9). Today, we notice on average ∼500 unique visitors per day.
A unique characteristic of ProteomicsDB is its ability to integrate large amounts of diverse data.
For example, while Expression Atlas (10) provides differential and baseline proteomics and transcriptomics data for a diverse set of organisms that can be explored online, the analysis is limited to the investigation of single experiments. In ProteomicsDB, the expression information across hundreds or thousands of experiments can be D1542 Nucleic Acids Research, 2022, Vol. 50, Database issue investigated simultaneously. In MaxQB (11), researchers are able to retrieve data from individual proteins similar to ProteomicsDB. However, the stored data are limited to proteomics with a limited number of distinct experiments. For example, the expression information of epidermal growth factor receptor (EGFR) in MaxQB covers 11 cell lines while ProteomicsDB provides information for 41 tissues and body-fluids as well as for 60 cell lines. For 52 of these, ProteomicsDB also provides cell viability information.
Large data stewards, like ProteomicsDB, have the obligation to provide access to its data content in a way that also enables other researchers to reproduce, reanalyze and integrate the data. The specific requirements and principles behind this concern the Findability, Accessibility, Interoperability, and Reusability (FAIR) of (research) data (12). Following this movement, additional work expanded these principles in order to account for (research) software as well (13). The need for this separation becomes clear when considering one concrete principle. The reusability aspect of data is met when rich descriptions of the data are made available in a common data format. For software, this principle is additionally linked to the maintainability of the codebase. This includes the availability of appropriate documentation of the source code (13). The FAIR principles are at the very core of open science and are essential for the scientific community to use generated data effectively. As such, they were a major focus guiding the development of ProteomicsDB over the last 2 years.
In this update, we discuss the developments of Pro-teomicsDB of the last two years, and specifically highlight our progress in turning ProteomicsDB into a FAIR and open source resource for life science research. For that purpose, we designed and implemented a reference architecture for ProteomicsDB (14) to enable fast development of new services and keep these services maintainable, manageable and extendable in future. Based on this, we created a new API that gives users access to essentially all data stored in ProteomicsDB achieving a major step toward enabling FAIR data access. We also release an open-source re-implementation of the user interface (UI) that not only turns the frontend into a reusable and expandable resource by external developers but also brings ProteomicsDB in accordance with modern web standards. In light of this, a new visualization was added that shows the primary, secondary and tertiary structure of proteins. In addition, we imported new data into ProteomicsDB, including data from a new organism, rice (Oryza sativa ssp. japonica), and we created a pipeline to improve the quality of the proteomics data stored within ProteomicsDB by using Prosit, a deep neural network that can predict various properties of peptides (15,16).

Full access to data stored in ProteomicsDB via new API
ProteomicsDB offered access to its data in form of an application programming interface (API) since its inception. However, the available APIs limited access to 10 predefined views all centered on the proteomics data. Already then, users did not have access to a large number of internal tables storing information on, for example, the used controlled vocabularies and neither to the newly supported omics data added in the past years. For this reason, we developed a new central API version two here (APIv2.0) that provides access to essentially all data currently stored in ProteomicsDB ( Figure 1). During its development, we followed the guidelines and recommendations of the FAIR principles (12) with a focus to make the API of ProteomicsDB accessible and usable for both (non-)bioinformatics researchers and developers. The new version incorporates the functionality of all previously offered APIs turning it into the central (programmatic) access point to the data stored in Pro-teomicsDB.
An important aspect of offering FAIR data access is to use established standards. For this reason, we decided to use the OData Version 2.0 Protocol (https://www.odata. org/documentation/odata-version-2-0/overview/). OData is used for creating HTTP-based data services that can be queried by web clients using standard HTTP messages and respond in a standardized structure. For each OData service, metadata concerning the service is automatically created. This ensures the compliance of all already created and all future API endpoints regarding their findability, accessibility, interoperability and reusability. Furthermore, OData offers a large set of automatically generated functionalities, such as filtering and data formatting [in JSON (17) and XML (18)]. These features are consequently all available in our APIv2.0.
For easier navigation, we separated the entire data model of ProteomicsDB into 19 topic clusters. A topic cluster groups multiple entities (e.g. samples and experiments) that contain information about a similar content type (e.g. the repository or transcriptomics data). For example, the repository of ProteomicsDB is such a topic cluster ( Figure 2) where the data and relation between projects, experiments, samples, files, measurements and supplementary files can be queried. The APIv2.0 allows to query in total 93 entities. To query an entity, the URL only contains the requested entity, e.g. /api v2/api.xsodata/Sample. This query will return the descriptions and metadata to all available samples in Pro-teomicsDB.
A central objective of the APIv2.0 was that users can navigate from one entity to another. This was realized by the 'navigation properties'. These navigation properties allow users an easy traversal between entities in multiple directions. For example, from the list of samples users can navigate to a list of all files that are connected to this sample or navigate to the respective experiment of that sample ( Figure 2). This can be achieved by querying for '/api v2/api.xsodata/Sample(ID)/File' or '/api v2/api.xsodata/Sample(ID)/Experiment', respectively. This feature is available for all entities within a topic cluster and where possible across topic clusters. With this step, we simplify access and allow users to systematically query for data originally separated into multiple APIs. In accordance with the FAIR principles, all entities in ProteomicsDB come with a Global Unique Identifier (G UID) that follow the format: PRDB UID:PRDB:<EntityName>:<LoacalIDOfEntity > .
A detailed description of the APIv2.0 is available online (https://www.proteomicsdb.org/vue/apiv2/). Here, we list all Figure 1. The architecture of ProteomicsDB. The data content and data layer of ProteomicsDB are accessible via three application programming interfaces (APIs). The API4UI is used by the frontend and contains predefined requests to the data in ProteomicsDB for the purpose of data visualization. The novel vue-based visualization layer of ProteomicsDB (top left) is separated into three levels. The proteomicsdb-components package is agnostic toward ProteomicsDB and thus usable on any website. The package proteomicsdb-wrappers connects the components with ProteomicsDB and can be re-used on any website as well. The package proteomicsdb-view contains the entire vue-based frontend of ProteomicsDB. The APIv1.1 is used by external resources (top right) and will remain publicly available. The new APIv2.0 provides access to virtually any datasets stored in ProteomicsDB. available entities, their attributes (columns) and possible navigation properties to other entities. Additionally, each navigation property and entity listed also includes an example request. In order to find relevant entities and navigation properties, we implemented a search functionality that allows searching for any content listed in the API documentation (i.e. entities, attributes, navigation properties and examples).
We are continuously working on extending Pro-teomicsDB and due to this, the APIv2.0 will also be subject to changes, such as the addition of new navigation properties, entities and columns. The newly developed reference architecture for ProteomicsDB (14) enables versioning. Because of that, currently available endpoints will remain available even in the rare event of modifications to the internal representation of the data. When using the APIv2.0, we recommend to only request necessary data by using e.g. the filtering options of OData to reduce the overall response time as the largest table of ProteomicsDB exceeds 40 billion entries. The new API is a substantial improvement over the status quo and will enable scientist to benefit from the wealth of data stored in ProteomicsDB as well as an easier integration of data from ProteomicsDB into their applications and databases.

Open-source ProteomicsDB frontend via reimplementation in Vue.JS
The current user interface (UI) of ProteomicsDB was built based on a SAP specific framework, termed SAPUI5. However, even its open-source variant, OpenUI5, is infrequently used in research. Due to this, developers in the field of life science research are unlikely to integrate or reuse the applications and visualizations developed for ProteomicsDB. Each table is available in the API as a separate entity (square boxes). To navigate between entities with (dashed black arrows) or across (solid black arrows) topic clusters, corresponding navigation properties were defined that allow the traversal of the available data. A detailed documentation of the API is available online under https://www.proteomicsdb.org/vue/apiv2/.
Hence, open-sourcing the current UI is of little value to the scientific community. In accordance with our goal of turning ProteomicsDB into a FAIR resource, we set out to re-implement the UI of ProteomicsDB focusing on modularity, reusability and flexibility. The current version of the re-implementation (https://www.proteomicsdb.org/vue) covers all functionality required to browse and interact with the results stored for a single protein of interest as well as two analytics.
We selected Vue.js (https://vuejs.org/) in combination with the Vuetify (https://vuetifyjs.com/en/) package as the new frontend framework. This decision was made because of two reason. First, it is intuitive and well documented, which is important for creating a maintainable and reusable UI. Especially (external) developers interested in generating a new visualization will benefit from this. Second, the component system (modularization) of Vue.js allows easy encapsulation of functionality and subsequently reuse of visualizations. In line with our goal to improve the FAIRness of ProteomicsDB, we decided to exploit this core feature of Vue.js and separate our new interface into three functional levels ( Figure 1, top left). The package proteomicsdb-components (https://github. com/wilhelm-lab/proteomicsdb-components) provides the base functionality for different visualization used in Pro-teomicsDB. They are agnostic to ProteomicsDB and thus can be reused on any website without specific dependencies and can be connected to any other source of data. The package proteomicsdb-wrappers provides wrappers (https:// github.com/wilhelm-lab/proteomicsdb-wrappers) for these visualizations that request the data from ProteomicsDB. These wrappers can also be used on any website but will require a connection to ProteomicsDB. Last, these visualizations are combined into views in the package proteomicsdb-views (https://github.com/wilhelm-lab/ proteomicsdb-views) that can be thought of as subpages in ProteomicsDB.
All of these levels are publicly available on GitHub as separate repositories. We expect that this will further improvement the findability and accessibility, but particularly the reusability of the code base of ProteomicsDB. Each of the three repositories are identified by individual Digital Object Identifiers (DOI), while each version can be uniquely identified with the associated git commit hash.
With the switch to Vue.js and the reimplementation necessary for that, we also decided to redesign the layout of Pro-teomicsDB to provide a more intuitive and modern looking experience (Figure 3). The organism selection previously located on the left of the screen is now moved to a drop-down menu located at the top left, next to the ProteomicsDB logo. The main tabs that were previously at the top of the screen can now be access on the right side of the screen after clicking the three stacked horizontal bars (hamburger button) in the top right of the screen. Otherwise, they are hidden to dedicate a larger proportion of the screen to the current view. At the top center of the screen a new universal search field can be found that can be used as direct entry point to all aspects of ProteomicsDB.
After searching for a gene of interest and selecting a specific protein/isoform, the UI changes and a second menu appears on the left. This menu shows the different navigation options to investigate, for example, the observed peptides or expression pattern. The blue bubbles indicate whether and how much data are available in this view, for example, 137 distinct peptides identified for protein EGFR (Figure 3). The views available here are largely identical to the old UI, but some slight adjustments were made. For example, the biochemical assay tab was split into three separate views that show the available binding data for different inhibitors, melting behavior and turnover data.
In addition to the redesign of the UI, two new visualizations were created for ProteomicsDB. First, the Feature Viewer (Figure 4), which is a custom adjustment (https: //github.com/wilhelm-lab/protvista-proteomicsdb) of protvista-uniprot (https://github.com/ebi-webcomponents/ protvista-uniprot) that depicts primary (e.g. sequence coverage and conservation) and secondary (e.g. domains, solvent accessibility and disordered regions) structure information of the selected protein. The properties shown originate from internal data or external resources (19)(20)(21)(22) and are shown as separate tracks. Each track can be expanded to reveal a more detailed view (Figure 4, secondary structure), while a specific region of one attribute can be selected to reveal additional information (Figure 4, gray popup on the domain FU 496-547). In addition, available 3D structures are retrieved from PDB (22) and listed. A single structure can be selected (Figure 4 The second example of a vastly improved visualization is the spectrum viewer ( Figure 5) that is a modified version of the Universal Spectrum Explorer (23). It is accessible by selecting a specific peptide of interest in either the Peptide MS/MS or Reference Peptides view that show a table with the observed or synthetic/predicted reference peptides for the selected protein. As in the old version, every peptide spectrum match (PSM) stored in ProteomicsDB can be investigated here. Selecting a PSM ( Figure 5, top left) fetches the associated spectrum. By default, a corresponding predicted reference spectrum is generated in real-time by Prosit and can be used to manually verify the correctness of the identification. In addition, reference spectra stored in ProteomicsDB from e.g. ProteomeTools (24) can be selected.
The reimplementation of the UI in Vue.js not only will enable external developers to be able to reuse views and visualization developed for ProteomicsDB but also shows that external views can be reused in ProteomicsDB. The availability of the source code on GitHub also creates a communication channel with users and developers that can report bugs and request new features, all supporting the FAIRification of ProteomicsDB.

Increasing peptide and protein coverage by rescoring of FAIR data
Our recently described deep-neural-network Prosit was trained to predict the fragment intensities and retention times of peptides (15). Such prediction can be used to im-prove the separation between correct and incorrect matches of database search engine results (25). To achieve this, theoretical spectra of the proposed peptide sequences are predicted using Prosit and compared to the experimentally observed spectrum. Based on this, a variety of intensity-based scores are calculated. This rescoring process supports that published datasets often contain more information than what was initially discovered (26) and that FAIR datasets are a rich resource for novel findings. Additionally, it can be used to align and compare the results obtained from different database search engines (16).
Considering the large amounts of data made available via ProteomicsDB, we decided to integrate the rescoring workflow directly into ProteomicsDB to enable the automatic re-processing of any FAIR dataset. The workflow ( Figure 6A) can be triggered on datasets which have an associated ProteomeXchange (27) identifier. The associated raw mass spectrometry files are then automatically downloaded from PRIDE (28). Together with the reconstructed database search engine results from ProteomicsDB, a regular rescoring by Prosit is triggered. Then the percolator results are imported into ProteomicsDB again. This does not overwrite any data of the original search results and during false discovery rate (FDR) estimation either the original search engines scores or the intensity-based scores from Prosit are used.
As a proof of principle, we rescored 30 tissues of the data published by Wang et al. (29) in which the proteomes and transcriptomes of healthy human tissues were characterized. When analyzing each tissue separately, on average 8289 (±1126 standard deviation, SD) proteins were identified without rescoring ( Figure 6B). The rescoring approach identified on average 8788 (±1088 SD) proteins across the different tissues. This is equal to an average relative increase of 6%. The largest benefit we observed was for bone marrow with a relative increase of 13%. The data for the small intestine benefited least from the rescoring but still showed an increase in the number of identified proteins by ∼4%. The effect on peptide level was even more pronounced. The number of identified peptides increased on average by 16% from 71 631 (±22 216 SD) to 82 165 (±22 209 SD). The tissues which benefited the strongest and the least on peptide level were bone marrow and brain with an increase of 40% and 7%, respectively. The large effects seen in bone marrow on peptide and protein level are most likely due to the overall lower number of identifications in this tissue. The biggest relative effect was observed for tissues with the smallest number of identified peptides without rescoring. This is consistent with previous observations that the rescoring is most beneficially when the identification rate is unexpectedly low, likely due to a strong overlap in targets and decoy matches (15).
In order to safely allow the combination of rescored and non-rescored data, we modified the FDR estimation procedure implemented in ProteomicsDB. As described earlier (30), we utilize Q-scores (-log 10 q-values) in order to combine results from different result sets. Figure 6C shows the Q-score distribution of target and decoy proteins. Here, the mouse data were chosen because of its high ratio of rescored data. The high degree of overlap between the number of estimated false positives (decoys) and likely incorrect targets in the low scoring region suggests that no bias is visible for proteins being supported by either rescored data or nonrescored data. This is further supported by the estimated distribution of true positives (target-decoy) that does not show any bimodality, suggesting that the decoy distribution accurately resembles the distribution of false matches in the target database.
Nucleic Acids Research, 2022, Vol. 50, Database issue D1547 Figure 5. Spectrum viewer. The spectrum viewer (bottom) visualizes the selected peptide spectrum match from the table in the top left. The configuration element on the top right can be used for, but is not limited to, retrieving reference spectra depicted in the mirror view to the bottom. Reference spectra can be generated in real-time by Prosit or requested from ProteomeTools. In between the experimental and reference spectrum, the alignment error between an observed and reference peak is shown in parts-per-million (ppm). The spectral similarity between the experimental and reference spectrum is measured by calculating the Pearson correlation (PCC) and normalized spectral contrast angle (SA). The measures inside the brackets show the result of this comparison when taking either the peaks of the experimental or reference spectrum into account whereas the values outside the brackets show the measures calculated taking all peaks form both spectra into account.
The systematic rescoring of datasets in ProteomicsDB is only possible due to resources such as PRIDE which enable the findability, accessibility, interoperability and reusability of raw mass spectrometry files. With the full integration of the rescoring approach into ProteomicsDB, the number of peptides and confidence in their identification can be increased. With the ever growing amount of data available in ProteomicsDB, accurately assessing the confidence of peptide spectrum matches will remain a challenge which will require regular checks to be able to assure high overall data quality.

Increasing the findability of aggregated data by Pro-teomicsDB
ProteomicsDB is the central point of access to aggregated information (e.g. protein expression) for a majority of its stored datasets and by that fosters their FAIRness. Over the last 2 years, many additional datasets were added to Pro-teomicsDB (Figure 7). We imported proteomics data from 32 projects investigating different human biology (29, that represent data on 40 new tissues and cell lines. In total, over 57 million experimental spectra and >500 thousand quantitative data points were added to ProteomicsDB. Considering the large amount of data previously available in ProteomicsDB, the effect on the number of identified proteins and genes is not less substantial, raising the confidence of 1281 protein isoforms and 878 genes to meet the <1% FDR criteria. Especially the FAIRness of dataset reporting aggregated data beyond protein expression values (e.g. melting curves or dose response curves) benefit from ProteomicsDB because even fewer resources exist for those. Most often such data are only available in the supplement of the original publication hampering FAIRness. Recently, we added protein-drug binding data, covering a new class of proteins, histone deacetylases (HDACs). The inhibition of HDACs has shown promise as therapeutic option in oncology and other conditions such as Duchenne Muscular Dystrophy (66). We imported data for 53 HDAC inhibitors covering 14 target proteins, totaling 735 HDAC dose-response curves (67).
Most notably, we extended ProteomicsDB to support the storage and visualization of data for a new organism, Oryza sativa ssp. Japonica (rice) ( Figure 7A). All functionalities of ProteomicsDB readily transfer to new organisms. For example, the visualization of expression values on a 'bodymap' (Figure 7A) only require the addition of a new organism visualization while the data retrieval, mapping and coloring of tissues is implemented generically. The im-ported data covers 28 rice tissues. In total, >4 million experimental spectra were imported resulting in the confident identification of close to 170 thousand distinct peptides of which >150 thousand are unique on gene level. Due to the imported data, 2621 of the 4051 annotated rice genes are confidently identified resulting in a coverage of 64%. For proteins isoforms, 13 742 of the 43 671 annotated were identified, resulting in an isoform coverage of 31%.

FUTURE DIRECTIONS
The updates introduced over the last two years provide a solid foundation of turning ProteomicsDB into a FAIR resource for life science research. There are three specific objectives we aimed to support by this. First, foster data re-use for wet-and dry-lab researchers and allow them to utilize and benefit from the wealth of data available. Second, share our efforts in developing modern and easy-touse web applications. Third, switch the development of Pro-teomicsDB to a community-driven effort. For this purpose, we are also currently developing a community portal within ProteomicsDB to allow users to share and discuss ideas about new visualization and features. At the time of writing, a direct line of communication between users and the current developers was established via GitHub where users can report discovered bugs or request new features. Ultimately, The integration of Prosit into ProteomicsDB enables the rescoring of all data stored in ProteomicsDB. On individual datasets, we observed an average increase in the number identified peptides by 16% and proteins by 6%. When performed on all data, this may increase the coverage of Pro-teomicsDB substantially and increase the quantitative precision by increasing the number observed peptides used to quantify each protein. In addition, this allows us to combine multiple database search engine results across and within datasets and will eventually enable us to integrate the results of novel search engines.
A strong focus of the next years will be on the finalization of the new interface, as well as the integration of substantially more data. Particularly the extension to support the storage, visualization and integration of data from experiments that investigated post-translational modifications will be of high priority. For this, new views and visualization are required, which can be developed much faster by the migration to the new reference architecture and Vue.js. We expect that the publicly available API and open source implementation of the UI will facilitate the development of novel applications and analytics. We further envisage that ProteomicsDB can be made available as private instances for research institutions, consortia or individual labs.