PMBC: a manually curated database for prognostic markers of breast cancer

Abstract Breast cancer is notorious for its high mortality and heterogeneity, resulting in different therapeutic responses. Classical biomarkers have been identified and successfully commercially applied to predict the outcome of breast cancer patients. Accumulating biomarkers, including non-coding RNAs, have been reported as prognostic markers for breast cancer with the development of sequencing techniques. However, there are currently no databases dedicated to the curation and characterization of prognostic markers for breast cancer. Therefore, we constructed a curated database for prognostic markers of breast cancer (PMBC). PMBC consists of 1070 markers covering mRNAs, lncRNAs, miRNAs and circRNAs. These markers are enriched in various cancer- and epithelial-related functions including mitogen-activated protein kinases signaling. We mapped the prognostic markers into the ceRNA network from starBase. The lncRNA NEAT1 competes with 11 RNAs, including lncRNAs and mRNAs. The majority of the ceRNAs in ABAT belong to pseudogenes. The topology analysis of the ceRNA network reveals that known prognostic RNAs have higher closeness than random. Among all the biomarkers, prognostic lncRNAs have a higher degree, while prognostic mRNAs have significantly higher closeness than random RNAs. These results indicate that the lncRNAs play important roles in maintaining the interactions between lncRNAs and their ceRNAs, which might be used as a characteristic to prioritize prognostic lncRNAs based on the ceRNA network. PMBC renders a user-friendly interface and provides detailed information about individual prognostic markers, which will facilitate the precision treatment of breast cancer. PMBC is available at the following URL: http://www.pmbreastcancer.com/.


Introduction
Breast cancer is the most commonly diagnosed cancer and has the highest mortality in women, according to Global Cancer Statistics 2020 (1).As a highly heterogeneous disease, subtype-specific therapeutic strategies were utilized to improve the survival of patients.However, de novo resistance to therapies or acquired resistance always led to unfavorable outcomes.
Numerous biomarkers have been identified and validated to predict the survival of patients since the wide application of microarrays and the next generation of sequencing.A 70-gene signature has been proposed for prognosis prediction and widely applied to select therapeutic strategies (2).Oncotype DX is a prognostic signature for hormone-receptor-positive breast cancer based on 21 genes (3).EndoPredict combines the expression of proliferative and etrogen reptor signalingassociated genes for the distant recurrence of breast cancer (4).
The PAM50 signature was developed to identify the intrinsic subtype and recurrence risk of breast cancer (5).

Data acquisition
We searched in PubMed (http://www.ncbi.nlm.nih.gov/pubmed) for publications with prognostic markers related to breast cancer.Specifically, we used the following search terms: '((prognostic

Network topological analysis
The R package igraph was used to calculate the topological features of RNAs in the ceRNA network (R 4.1.2).The degree of an RNA is the number of its neighboring RNAs in the ceRNA network.The closeness centrality of one RNA i is calculated as follows: , where nodes j is a different node from node i, d ij denotes the distance between node i and j, n is the total number of nodes in the ceRNA network.The betweenness of one RNA i is calculated as following: , where RNAs s and t are nodes different from RNA i in the ceRNA network, d st denotes the shortest paths from RNA s to t,  st denotes the number of shortest paths from RNA s to t which RNA i lies on.For two RNAs s and t, the ratio is the percentage of shortest path that RNA i lies on.The sum of all RNA pairs is the betweenness of RNA i.For RNA permutations, equal numbers of prognostic RNAs were randomly selected from all the RNAs in the human genome with replacement for 100 000 times.Cytoscape v 3.8.0 was used to visualize the ceRNA network.

Database construction
The database is freely available at http://www.pmbreast cancer.com/.The database was developed using Tomcat (v 9.0.0) as the Web server and using Asynchronous Javascript And XML (AJAX) technology to realize the front-end and back-end interactions.The MySQL (v 5.5) data server was used to store and manage the datasets.The creation of result tables was performed using jQuery (v3.6.0)plugin software.The website is supported on popular web browsers, such as Microsoft Edge, Google Chrome and Firefox.

Technical details
We used HyperText Markup Language for the webpage structure, CSS for graphic design and JavaScript with AJAX for interactive elements.We utilized AJAX technology for the fore-end to execute front-end development solutions, handling asynchronous requests through the $.ajax() method in the jQuery framework.For the back-end, we used Servlet to implement the back-end development solutions, selecting Tomcat as the web server to route the front-end requests to the corresponding Servlet based on the URL paths defined in the $.ajax() method.We connected the web-based platform with the database of breast cancer prognostic markers with MySQL for querying or inserting data and generating JavaScript Object Notation data in response to return it to system users.

Data source
We manually curated 714 publications and obtained 1070 prognostic markers appearing in 1357 records for breast cancer.As shown in

Basic statistics of the curated prognostic markers
Among the 1070 prognostic markers of breast cancer, we sorted them according to their frequency.As shown in Figure 1A, HIF1A is the most frequently reported prognostic marker, and AR, EGFR and MIK67 are the second most reported markers.HOX transcript antisense RNA is the most frequently reported prognostic lncRNA, while miR-125b, miR-205, miR-21 and miR-210 are the most frequently reported prognostic miRNAs.Besides, we rank the journals according to the number of reported prognostic markers (Figure 1B).The journal 'Cancers (Basal)' is at the top of the list, followed by the journals 'Breast Cancer Res' and 'Cancer Res'.

Database design
PMBC provides a user-friendly interface to allow users to easily query information for prognostic markers of breast cancer.A screenshot of the database for prognostic markers of breast cancer is shown in Figure 2. Specifically, the webpage consists of seven parts: 'Home', 'Browse', 'Search' 'Submit', 'Help' and 'Contact us'.In the 'Browse' page, users can browse prognostic markers by RNA type, including miRNA, lncRNA, mRNA, circRNA and pseudogene.PMBC also provides browsing with initials of prognostic markers.By clicking a specific initial, all prognostic markers with this initial are returned and shown in a table.In the 'Search' page, PMBC allows users to search by symbols of prognostic markers.The database enables fuzzy searching, allowing users to return the most possible matching results.The corresponding results will be returned about the searched prognostic marker, including the publication ID in the PubMed database of National Center for Biotechnology Information, the RNA type of the marker, the favorable or poor outcome for patients with high expression of the marker and the detailed description in the literature.PMBC designed a 'Submit' page that enables researchers to submit up-to-date prognostic markers of breast cancer.Once approved by the review committee, the background database will be updated including the submitted record.Moreover, a step-by-step tutorial is also rendered to facilitate users quickly know how to use the database in the 'Help' page.The users can also contact us with the information provided in the 'Contact us' page.

Function enrichment
We performed functional enrichment to explore the biological processes and pathways for the known prognostic mark-ers.The results showed that these markers were enriched in 'regulation of epithelial cell proliferation', 'positive regulation of mitogen-activated protein kinases (MAPK) cascade', 'gland development', and 'epithelial cell proliferation' (Figure 3A).Among the significant pathways, the pathways enriched by the greatest number of prognostic markers are 'Proteoglycans in cancer', 'MicroRNAs in cancer', 'PI3K-Akt signaling pathway' and 'MAPK signaling pathway' (Figure 3B).
Known prognostic markers play pivotal roles in the ceRNA network The mRNAs and lncRNAs compete to bind with miR-NAs and function as ceRNAs.Therefore, we systematically Figure 3. Functional enrichment of prognostic markers.We performed the functional enrichment of the known prognostic RNAs in biological processes of Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway.The significant terms were ranked according to the number of prognostic markers.The top 10 significant terms were shown for (A) GO and (B) KEGG.
characterize prognostic markers of breast cancer based on the ceRNA network.These prognostic markers were mapped to the downloaded ceRNA network, which comprises mRNA, lncRNA and pseudogene.As a result, 631 markers have interactions in the ceRNA network, including 602 mRNAs, 28 lncRNAs and one pseudogene.Then, we characterize the topological features of these prognostic markers from the perspectives of degree, closeness and betweenness.The RNAs (nodes) in the ceRNA network were ranked according to these three topological features, respectively.By comparing the topological value of the prognostic RNAs with that of the random RNAs, we found that known prognostic RNAs have higher normalized closeness than random (P < 1.00E − 05, Table 2, Figure 4A), which suggests that they have a shorter distance to other nodes in the ceRNA network.Besides, we divided the prognostic markers into three RNA types, lncRNA, mRNA and pseudogene.
Compared with the random RNAs, prognostic lncRNAs have both higher raw and normalized degrees, (P < 1.00E − 05, Table 2, Figure 4B and C).For prognostic mRNAs, they have significantly higher normalized closeness (P < 1.00E − 05, Table 2, Figure 4D).From these results, we conclude that the lncRNAs play important roles in maintaining the interactions between lncRNAs and their ceRNAs, which might be used as a characteristic to prioritize prognostic lncRNAs based on its ceRNA network.
We further extract the ceRNA subnetwork for the 631 prognostic markers, and each pair of markers competitively binds to same miRNAs.The network is shown in Figure 5.The lncRNA NEAT1 competes with 11 RNAs, including lncRNAs and mRNAs, to bind with miRNAs.Majority of the ceRNAs of ABAT belong to pseudogenes, while nearly all the ceRNAs of the pseudogene PSPHP1 are pseudogenes.The topological features with P < 0.05 are marked in Bold.

Future extensions
We will review and select newly reported prognostic indicators for breast cancer every 6 months and update the database with user-submitted material monthly.In the current ceRNA network, there are no interactions for circRNA, which fail to characterize them in the ceRNA network.It will be our future direction to reconstruct the ceRNA network to include more RNA types, including but not limited to circRNA.The extended number of prognostic markers and incorporation of more comprehensive ceRNA interactions will improve the utility and coverage of this database.We will have a global and precise understanding of prognostic markers of breast cancer.

Discussion
We have developed a user-friendly database, PMBC.PMBC provides curated prognostic markers for breast cancer.It is different from survival analysis web tools in that it (i) is composed of prognostic markers that are manually curated from publications, (ii) contains various types of RNA markers, including mRNA, miRNA, lncRNA, circRNA and pseudogene, which may not simultaneously be detected in one dataset, (iii) enables the characterization of prognostic markers from the perspective of the ceRNA network and (iv) facilitates the prioritization of prognostic markers based on the network topology of markers in PMBC.To ensure quick and easy searching ability, an intuitive querying interface was implemented in PMBC.

Conclusions
PMBC hosts comprehensive prognostic markers spanning from mRNAs, lncRNAs, miRNAs, to circRNAs.Besides, it provides convenient and user-friendly interfaces to explore the known prognostic markers.We have generated PMBC as a cross-validation tool for researchers to identify potential prognostic RNA biomarkers to follow up for further research.PMBC allows for rapid retrieval and organization of information about cancer prognostic markers, facilitating researchers in comprehending pertinent data and enhancing therapy efficacy and patient well-being.PMBC will help accelerate translational research from bench to bedside.

Figure 1 .
Figure 1.Basic statistics of the curated prognostic markers.(A) Most frequently reported prognostic markers.(B) The journals with the top number of prognostic markers.

Figure 4 .
Figure 4. Significant topological features of prognostic markers.We compared the topological value of the known prognostic RNAs with that of the random RNAs.(A) The normalized closeness was compared with that of random.The raw (B) and normalized (C) degree of prognostic was compared with that of random.(D) The normalized closeness of prognostic mRNAs was compared with that of random.

Figure 5 .
Figure 5. CeRNA network of prognostic markers.The blue represent mRNAs, the orange nodes represent lncRNAs and the dark gold nodes represent pseudogenes.The node size is proportional to its degree in the ceRNA network.

Table 1 .
Number of RNA records and unique RNA markers

Table 2 .
Topological features of prognostic markers