HBFP: a new repository for human body fluid proteome

Abstract Body fluid proteome has been intensively studied as a primary source for disease biomarker discovery. Using advanced proteomics technologies, early research success has resulted in increasingly accumulated proteins detected in different body fluids, among which many are promising biomarkers. However, despite a handful of small-scale and specific data resources, current research is clearly lacking effort compiling published body fluid proteins into a centralized and sustainable repository that can provide users with systematic analytic tools. In this study, we developed a new database of human body fluid proteome (HBFP) that focuses on experimentally validated proteome in 17 types of human body fluids. The current database archives 11 827 unique proteins reported by 164 scientific publications, with a maximal false discovery rate of 0.01 on both the peptide and protein levels since 2001, and enables users to query, analyze and download protein entries with respect to each body fluid. Three unique features of this new system include the following: (i) the protein annotation page includes detailed abundance information based on relative qualitative measures of peptides reported in the original references, (ii) a new score is calculated on each reported protein to indicate the discovery confidence and (iii) HBFP catalogs 7354 proteins with at least two non-nested uniquely mapping peptides of nine amino acids according to the Human Proteome Project Data Interpretation Guidelines, while the remaining 4473 proteins have more than two unique peptides without given sequence information. As an important resource for human protein secretome, we anticipate that this new HBFP database can be a powerful tool that facilitates research in clinical proteomics and biomarker discovery. Database URL: https://bmbl.bmi.osumc.edu/HBFP/


Background
Human body fluids are thought to be rich resources of diseaseassociated proteins that are secreted or leaked from pathological tissues across the body, many of which are commonly obtainable through non-invasive procedures (1,2). Driven by these factors, research interests have soared a few decades ago toward biomarker discovery by examining body fluid proteomes. It is highly plausible that empowered by innovative high-throughput technologies, modern proteomic studies have successfully identified a large number of proteins in various body fluids such as plasma, serum, saliva and urine (3).
With great effort by a few large consortiums, several community-based proteomic databases have been developed in the past decades. For example, in 2002, the international Human Proteome Organization initiated the Human Plasma Proteome Project and reported human plasma and serum protein constituents in its online databases (4). Another similar database, named Plasma Proteome Database, archived more than 10 000 proteins detected in human blood (5). Additionally, the Proteomics Identifications database (6) and Human Plasma PeptideAtlas (7) report a total of 3509 highconfidence plasma proteins. More recently, the extracellular vesicles community also reports new proteins identified in exosomes in multiple different resources including blood and breast milk, e.g. in ExoCarta (8). Additionally, the global Human Proteome Project (HPP) announces a set of mass spectrometry (MS) data interpretation guidelines that are presented to the broader research community (9).
Our team has recently conducted a systematical assessment of human proteome identified using quantitative proteomics tools such as MS and computational predictive models, as documented in a recent review article (10). To expand this effort, we developed a new human body fluid proteome (HBFP) database to organize 11 827 unique proteins reported in 164 scientific articles since 2001, which has a maximal false discovery rate (FDR) of 0.01 on both the peptide and protein levels. Until today, this database stores information about proteins from 17 types of body fluids including plasma/serum, saliva, urine, cerebrospinal fluid (CSF), seminal fluid (SF), amniotic fluid, tear fluid, bronchoalveolar lavage fluid (BALF), milk, synovial fluid, nipple aspirate fluid, cervical-vaginal fluid, pleural effusion, sputum, exhaled breath condensate, pancreatic juice and sweat.  For each protein entry, description about protein secretion information, literature source, abundances, confidence and functional annotation is provided. This database system also provides users easy access to data visualization and download and functional analysis based on Gene Ontology (GO) and pathways.

Protein entries
We have manually collected proteins reported in 17 types of body fluids by carefully reviewing 164 scientific references published since 2001 based on a PubMed search with FDR ≤1% on both the peptide and protein levels.
In the HBFP database, each protein is assigned with a unique identifier of UniProtKB/Swiss-Prot accession (UniProt release 2020_06) (11). Since different identifiers have been mixed used in the referenced studies, we first used conversion tools at BioDBnet (https://biodbnet-abcc.ncifcrf.gov/) (12) and UniProt (https://www.UniProt.org/) to confidently convert different identifiers to UniProt accession numbers. The common identifiers involved in this study include International Protein Index ID [hosted at European Bioinformatics Institute (EBI) (closed in 2011)], GI number (from Genbank database), RefSeq protein accession (from RefSeq database), Gene name/symbol (from NCBI Gene database) and UniProt protein/entry name (from UniProt database). The ID conversion process is shown in Figure 1. During the conversion, poorly curated proteins with ambiguous identifiers were eliminated. For examples, many International Protein Index ID links to unclearly described instances that cannot be mapped to a UniProt entry are excluded.

Database utilities
The interface of the HBFP database is constructed by PHP, while the database system is based on MySQL. The main contents of the current database include query and browse pages described as follows.   Database, Vol. 2021, Article ID baab065

Querying page
As one of the most important functions, the querying page allows users to search for body fluid proteins based on different types of input including protein ID, gene name, and protein or gene sequence. When given a FASTA input, BLASTp or BLASTn is used to translate sequence input to the best-match protein entry. The top hit (the highest bit score) from the BLAST search is considered the best match of the query. Figure 2 illustrates the workflow and content of querying page. The annotation of each protein contains the following information: • Protein ID/name/entry name • Gene name • Associated body fluid type along with indicated discovery confidence (explained in the next section) • References and protein abundance information where the protein is reported • External links to public databases including UniProt, Pep-tideAtlas and NeXtProt (13), MassIVE (14) • Functional annotation based on the KEGG pathway (15) and GO (16)

Browsing page
This page provides an overview list of proteins associated with 17 types of body fluids and links to view and download selected proteins.

Data statistics
When determining the inclusion of reported proteins, we applied the following criteria for credibility of the MS evidence. First, for papers that issued peptide sequence details, we remapped all those peptide sequences to neXtProt (release 2021-02-15) using the neXtProt peptide uniqueness checker to remove unreliable matches (17). Specifically, we applied guideline #15 of HPP Guidelines 2.1 (9) to include proteins that contain at least two non-nested uniquely mapping peptides of nine amino acids into the HBFP database. According to this criterion, 7354 proteins were confirmed confidently. Another 4473 proteins were also included as they were not explicitly provided with peptide sequence information but have more than two unique peptides.
The overall statistics about the protein entries and references in terms of each body fluid are summarized in Table 1. The current HBFP database contains 11 827 distinct proteins from 17 types of body fluids. Note that urine exceeds all other body fluids in terms of protein counts while blood is at the second rank. All data are made publicly available in the HBFP and via links at https://bmbl.bmi.osumc.edu/ HBFP/.

Protein abundance
In order to provide users experimental evidence from the original study, this database also displays relatively abundant information from the corresponding literature studies. General proteomics approaches using MS identify proteins by matching identified peptides against predefined protein sequence databases. The qualitative measures of protein reported in the original reference include the following: (i) peptide information: most of cited studies provide explicit information about peptide sequence, the total number of peptides, MS counts or the percent sequence coverage; (ii) differential expression information including fold change (positive value demonstrates up-regulated expression and negative

Confidence score
In the HBFP database, to evaluate the confidence level of each discovered protein in each body fluid, a new statistical measure is calculated based on Guideline # 9 of HPP guidelines 2.1 for the combined datasets. It is a well-known phenomenon that when taking N datasets with a substantial FDR and piling them all together, the overall FDR increases with the number of datasets. For example, for plasma, there are 38 papers with plasma protein lists, each with a substantial FDR (≤1%). It is probably a conservative estimate to suppose that the FDR of such a combined result is 1% + 0.5%×(N datasets−1) (9). It means that 50% of the correct identifications overlap and none of the incorrect ones does, so the resulting FDR is added in a 0.5% increment. Meanwhile, the confidence level of protein in the combined datasets is also reduced. Otherwise, considering the overlap of the true positives, the larger the number of datasets in which a protein is associated with a specific fluid, the more reliable this protein is. In the end, a confidence score C is calculated as follows: The protein O14791 is identified in blood by 19 independent studies, i.e. M j = 19. As a result, the calculated C i,j score for O14791 in blood is 0.895. Meanwhile, protein Q9UJV9 only is identified in one paper for blood, so M j = 1 and C i,j = A i = 0.805. It means that protein Q9UJV9 maintains only the confidence level in the combined datasets of blood. Specifically, protein P01833 is identified in milk by 14 studies, and a total of 14 literature studies on milk are included in the HBFP, so protein P01833 maintains the original confidence level, i.e. 0.99. The larger the C score, the higher the confidence that a protein reported in that fluid will be. Note that this score can only be compared within the same type of body fluid.

Query
All proteins can be easily accessed by searching protein ID, gene name, protein sequence (FASTA) or gene sequence (FASTA) (<50 items per query) ( Figure 4A and B as an example). A BLAST (182) is performed locally to find the best match when the sequence FASTA format is given. For each protein, detailed information is displayed ( Figure 4C).
Users can connect directly to the PubMed or Google Scholar to view the original study through the provided links. Four databases (UniProt, PeptideAtlas, NeXtProt and MassIVE) are cross-linked for additional protein annotation, while the KEGG pathway and GO are focused on the functional aspects ( Figure 4D).

Download
HBFP allows users to browse the entire protein list in each body fluid, where the proteins are ordered based on descending confidence scores. Users can check and download all entries of the selected body fluid type in one go, as shown in Figure 5.
Demo of comparative analysis using the HBFP database  (Table 2).

Venn diagram and GO annotation
To take a closer look at this comparison, we focused on five body fluids that have the most protein counts, including blood, urine, CSF, SF) and BALF. An interesting discovery is that urine shares large numbers of common proteins with other fluids (Figure 7). A total of 4109, 3212, 2990 and 2950 proteins overlapped between the plasma and the other    four body fluids (blood, CSF, SF and BALF, respectively). There are 965 proteins commonly detected in all five body fluids. The functional analysis using the BiNGO tool (183) in Cytoscape (184), reflecting information about cellular localization, molecular function and biological process of these proteins (Figure 8).

Conclusions
The new HBFP database developed in this study represents the first of its kind as a comprehensive reference resource of HBFP. All data are available through an open-access userfriendly Web platform. All protein entries were manually curated, which can be easily traced back to the original literature. Users can query and download proteins of interest to verify discovery in their own study or conduct an in silico analysis on human secretomes. We currently schedule a regular update every 6 months. The future plan is to include computationally identified proteins using statistical and machine learning approaches (185)(186)(187)(188)(189)(190)(191). In the past decade, many computational studies have revealed unique strengths in overcoming challenges in profiling-based proteomics research in terms of discovering new protein bioavailability and functions. Those computationally predicted proteins can serve as a secondary resource for biomarker discovery. In summary, by providing a wealth of information and functional analysis, we believe the HBFP database can be an excellent tool for the research community to explore human proteome in various body fluids.