Knowledge Graph for Breast Cancer Prevention and Treatment: Literature-Based Data Analysis Study

Background: The incidence of breast cancer has remained high and continues to rise since the 21st century. Consequently, there has been a significant increase in research efforts focused on breast cancer prevention and treatment. Despite the extensive body of literature available on this subject, systematic integration is lacking. To address this issue, knowledge graphs have emerged as a valuable tool. By harnessing their powerful knowledge integration capabilities, knowledge graphs offer a comprehensive and structured approach to understanding breast cancer prevention and treatment. Objective: We aim to integrate literature data on breast cancer treatment and prevention, build a knowledge graph, and provide support for clinical decision-making. Methods: We used Medical Subject Headings terms to search for clinical trial literature on breast cancer prevention and treatment published on PubMed between 2018 and 2022. We downloaded triplet data from the Semantic MEDLINE Database (SemMedDB) and matched them with the retrieved literature to obtain triplet data for the target articles. We visualized the triplet information using NetworkX for knowledge discovery. Results: Within the scope of literature research in the past 5 years, malignant neoplasms appeared most frequently (587/1387, 42.3%). Pharmacotherapy (267/1387, 19.3%) was the primary treatment method, with trastuzumab (209/1805, 11.6%) being the most commonly used therapeutic drug. Through the analysis of the knowledge graph, we have discovered a complex network of relationships between treatment methods, therapeutic drugs, and preventive measures for different types of breast cancer. Conclusions: This study constructed a knowledge graph for breast cancer prevention and treatment, which enabled the integration and knowledge discovery of relevant literature in the past 5 years. Researchers can gain insights into treatment methods, drugs, preventive knowledge regarding adverse reactions to treatment, and the associations between different knowledge domains from the graph.


Introduction
Breast cancer is the most common malignant tumor in women worldwide, with a reported death toll exceeding 600,000 in 2018 alone [1].Breast cancer has emerged as the most prevalent cancer and a primary cause of mortality among women.The global incidence of new cases of female breast cancer witnessed a sharp increase from 1.05 million in 2000 to 2.09 million in 2018 [2].In 2020, global cancer burden data revealed that new breast cancer cases reached 2.26 million, constituting 11.7% of all newly diagnosed cancer cases worldwide.The newly reported mortality cases numbered 0.68 million, representing 6.9% of global newly reported deaths [3].Factors such as old age, young age at menarche, family history of breast cancer, smoking, and drinking alcohol increase the risk of breast cancer [4][5][6].On the contrary, regular physical exercise; breastfeeding; regular work and rest; and intake of fruits, vegetables, whole grains, and dietary fiber can appropriately reduce the risk of breast cancer [7].Various treatment methods are used for patients with breast cancer, including surgery, radiation therapy, endocrine therapy, chemotherapy, and targeted therapy.So far, most countries have primarily focused on population education for breast cancer prevention, including encouraging increased physical activity, controlling BMI, and limiting alcohol intake [8].Despite the increasing number of research literature, a large amount of literature on breast cancer prevention and treatment has not been systematically integrated.Knowledge graph technology allows for the independent connection and integration of disparate literature, resulting in a more comprehensive and cohesive knowledge framework.
Knowledge Graph is a knowledge repository proposed by Google in 2012 to enhance the functionality of search engines.It describes concepts and their relationships in the real world using triplets in the form of entity-relation-entity [9].Knowledge graphs can integrate information from diverse sources and domains, including text, databases, and web pages, and intricately interlink them.These integrations serve to mitigate information silos, fostering the establishment of a more comprehensive knowledge framework.Knowledge graphs have been widely used in various fields, such as medicine, network security, journalism, finance, and education [10].Knowledge graphs in the biomedical domain have applications in studies related to disease associations [11], genomics [12], drug interactions [13], and support for physicians in formulating individualized treatment regimens [14].At present, there are well-established knowledge graphs, including DisGeNET [15], which integrate information on the associations between genes and diseases; DrugBank [16], a comprehensive bioinformatics and cheminformatics knowledge base; and ClinVar [17], a compilation of genetic variation information from diverse laboratories worldwide.One study extracted breast cancer-related features from Chinese breast cancer mammography reports and built a knowledge graph for diagnosing breast cancer by combining diagnosis and treatment guidelines and insights from clinical experts [18].Another study integrated triples from clinical guidelines, medical encyclopedias, and electronic medical records to build a breast cancer knowledge graph [19].Despite a small number of scholars having constructed knowledge graphs for breast cancer, the varied emphases and diverse data sources employed render their applicability limited.A knowledge graph specifically focused on the prevention and treatment of breast cancer has not been constructed at present.Therefore, this study primarily collects information related to the prevention and treatment of breast cancer to construct a knowledge graph.
In the biomedical field, there are already mature tools (eg, SemRep) for extracting knowledge from medical texts.SemRep is a natural language processing program based on the Unified Medical Language System (UMLS), which performs operations such as text tokenization, syntactic analysis, part-of-speech disambiguation, phrase mapping, semantic predicate normalization, and syntactic constraints [20].It extracts entities and relationships from biomedical texts and outputs triplets stored in the Semantic MED-LINE Database (SemMedDB) [21].SemMedDB currently encompasses details on approximately 96.3 million predications derived from all PubMed citations (around 29.1 million citations) and serves as the foundation for the Semantic MEDLINE application [22].We downloaded the entity and relationship data provided by SemMedDB.NetworkX is an open-source library for Python, primarily designed for creating, analyzing, and visualizing complex network structures.NetworkX plays a significant role in knowledge visualization, facilitating users in intuitively presenting and comprehending intricate knowledge graphs or network data.

Ethics Approval
This study was approved by the Board of Medical Ethics Committee of Shenzhen Maternal and Child Health Hospital (SFYLS[2022]003).

Data Source
We conducted a search on PubMed using Medical Subject Headings terms "breast cancer," "prevention," and "treatment," covering the period from January 1, 2018, to December 31, 2022, and the study type was clinical trials.A total of 3589 articles were retrieved.We obtained the entity and relationship data from SemMedDB.

Data Processing and Construction of Knowledge Graph
We matched the PMIDs of the retrieved articles with the database and extracted the corresponding triplet information.We initially obtained 33,060 Subject-Predicate-Object (SPO) triplets of data.
Next, we made improvements according to the SPO cleaning principles proposed by Fiszman et al [9] (ie, relevance, connectivity, novelty, and significance).We combined them with expert manual screening to ensure that the selected SPO triplets have a higher relevance.In the improved process, we did not predefine semantic patterns.Instead, we used a series of cleaning operations to select core SPO triplets and connected SPO triplets, eliminating SPO triplets lacking specific information and those that appeared only once in the frequency.The specific process is as follows: 1.In the same article, there may be repeated occurrences of identical SPO triplets.To maintain equal contribution from each article, we counted the repeated SPO triplets once within the same article. 2. To ensure statistical reliability, we calculated the occurrence frequency of each SPO triplet across different articles.SPO triplets with low occurrence frequencies may lack statistical significance.Therefore, we filtered SPO triplets with frequencies greater than or equal to 2.
3. Based on expert domain knowledge, we manually screened the selected SPO triplets with frequencies greater than or equal to 2 to identify those of research value.
Finally, we obtained 25,449 SPO triplets data.We imported the filtered SPO triplets information into the NetworkX for visual analysis to explore knowledge and information related to breast cancer prevention and treatment.

Summary of Included Literatures
A total of 3589 articles were published in 618 different journals.Among them, 191 articles were published in the same journal, while 293 journals had only 1 article published.The journals were ranked based on the number of publications, and the top 100 journals accounted for 2631 articles, which is 73.30% of the total.

Semantic Relationships and Semantic Patterns
We mainly summarize semantic associations into 3 types: treatment and prevention, influencing or associated factors, and related diseases (Table S1 in Multimedia Appendix 1).Regarding treatment and prevention, the relationships include TREATS, ADMINISTERED_TO, USES, and PREVENTS, representing treatment drugs, surgeries, and preventive measures for breast cancer.Regarding influencing or associated factors, the relationships include ASSOCI-ATED_WITH, AFFECTS, and CAUSES, which represent diseases' impact and etiological factors.Regarding related diseases, the relationship COEXISTS_WITH represents the coexistence between different diseases.In the semantic patterns involving treatment (TREATS), the topp-TREATSneop and topp-TREATS-podg have appeared over 1000 times.

Breast Cancer Knowledge Graph
We visualized the SPO triples and displayed 3 subgroups: breast cancer treatment methods, therapeutic drugs, and relevant preventive measures.Figure 1 shows the relationship between different subtypes and stages of breast cancer and treatment methods.In different subtypes of breast cancer, the highest frequency is observed in malignant neoplasm of the breast, with pharmacotherapy having the highest frequency among various treatment modalities.Different subtypes simultaneously correspond to multiple treatment modalities; likewise, a single treatment modality corresponds to multiple breast cancer subtypes.Figure 2 shows the relationship between different subtypes and stages of breast cancer and drugs.Among the therapeutic drugs for breast cancer, trastuzumab has the highest frequency and corresponds to the most types of breast cancer.Capecitabine, paclitaxel, aromatase inhibitors, and immunologic adjuvants also have relatively high frequencies.In comparison, immunologic adjuvants have the fewest connections with different types of breast cancer.Figure 3 shows the relationship between breast cancer treatment and adverse reactions.Pharmacotherapy is associated with neuropathy, onycholysis, heart neutropenia failure, alopecia, febrile neutropenia, anemia, stomatitis, leukopenia, thrombocytopenia, premature menopause, and gastrointestinal dysfunction.Additionally, multiple nodes are connected, forming multiple pathways, such as pharmacotherapy-febrile neutropenia-adjuvant chemotherapy and pharmacotherapy-leukopenia-breast cancer therapeutic procedure-osteoporosis. Figure 4 shows the relationship between adverse events after breast cancer treatment and preventive measures.Peripheral neuropathy is associated with cryotherapy, low-level laser therapy, compression procedure, acupuncture procedure, pharmacotherapy, and massage.Lymphedema is associated with resistance education, axillary lymph node dissection, physical therapy, excision of axillary lymph nodes group, and drainage of lymphatics.Early radiation dermatitis is associated with topical administration and bleomycin, cisplatin, or methotrexate protocol.In addition, there are some adverse reactions with relatively few treatment measures, such as stomatitis-diet, alopecia-scalp cooling.We performed a relationship visualization to gain a better understanding of the association between types of breast cancer, treatments, drugs, and genes.Figure 5 intuitively reflects the high frequency of malignant neoplasm of the breast, pharmacotherapy, and trastuzumab.In addition, breast malignant tumors are associated with multiple genes, such as the phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha (PIK3CA) gene, platelet-derived growth factor receptor beta (PDGFRB) gene, phosphatase and tensin homolog (PTEN) gene, and erb-B2 receptor tyrosine kinase 2 (ERBB2) gene.

Principal Findings
The knowledge graphs constructed in this study help researchers understand the research hot spots in breast cancer over the past 5 years.The complex network involving treatment methods, drugs, adverse reactions, preventive measures, and genes in breast cancer can assist clinicians in making decisions that comprehensively consider multiple aspects, ultimately aiding in decisions that are the most beneficial to patients.Additionally, the knowledge graph allows for personalized considerations based on specific genes for individualized patients.
This study found that from 2018 to 2022, breast malignancies appeared most frequently in the literature and were the primary concern for researchers.Research interest in triple-negative breast neoplasms is higher than in other subtypes.This phenomenon may be due to the higher risk of recurrence and poor prognosis in patients with early-stage triple-negative breast neoplasms [10], making it a subject of greater concern to clinicians and researchers.Among treatment modalities, pharmacotherapy receives the highest attention.Pharmacotherapy for breast cancer primarily involves chemotherapy, endocrine therapy, and targeted therapy [27].Compared to traditional surgery and radiotherapy, pharmacotherapy can more precisely intervene in the growth and division of cancer cells by targeting specific molecules or cellular structures, which reduces damage to normal cells and allows for the formulation of personalized treatment plans based on the patient's genotype and molecular characteristics [28].Medications circulating through the bloodstream can also act on cancer cells throughout the body, preventing cancer cell metastasis.These advantages of pharmacotherapy may be related to the heightened emphasis on pharmacotherapy over the past 5 years.Trastuzumab receives the highest attention in breast cancer pharmacotherapy; it is a specific cancer-targeting medication used in the treatment of cancers characterized by elevated levels of HER2 protein [29].
Pharmacotherapy is associated with various adverse reactions, including neutropenia, neuropathy, onycholysis, heart failure, alopecia, and febrile neutropenia.Among these adverse reactions, peripheral neuropathy and lymphedema have the most corresponding preventive and treatment measures, with lymphedema being a common complication after surgery [30].However, there is limited research on how to prevent and treat the potential adverse reactions of pharmacotherapy, and further studies are needed.Various adverse effects of breast cancer treatment may reduce patients' adherence to treatment.Therefore, when clinicians choose different treatments and drugs, they should pay close attention to their potential adverse reactions and how to prevent or mitigate them.
In existing knowledge graphs related to breast cancer, one study from China constructed a knowledge graph using electronic medical records, clinical guidelines, and expert opinions, primarily focusing on breast cancer diagnosis [18].Another study by Chinese scholars also used data from various sources, including clinical guidelines, medical encyclopedias, and electronic medical records, to construct a knowledge graph primarily applied to medical knowledge question-answering and medical record retrieval [19].These studies used data from multiple sources, including structured, unstructured, and semistructured data.Data extraction and accuracy face challenges.Therefore, they used neural network models for training and calculated a series of metrics to ensure data accuracy.For instance, they utilized BERT + Bi-LSTM+ CRF for textual data to achieve named entity recognition.In this study, SemMedDB was used as the data source, and the database was constructed by extracting semantic information from PubMed using SemRep, which demonstrated good performance in a biomedical text [21].
In summary, the knowledge graph constructed in this study for breast cancer treatment and prevention encompasses information on different stages, subtypes of breast cancer, treatment modalities, medications, adverse reactions, and preventive measures.This knowledge forms a complex network, providing clinical practitioners with a comprehensive and referenced knowledge base.We recommend that clinical practitioners apply our research findings in several aspects.First, clinicians can gain insights into the current state of breast cancer treatment and prevention research through our study.Additionally, there is a relative lack of preventive measures and strategies for mitigating postoperative and postmedication adverse reactions compared to breast cancer treatment, and more efforts are needed in these areas.Furthermore, our research can assist clinicians in making comprehensive decisions.For instance, when selecting a treatment approach for patients, the knowledge graph facilitates linking to available medications, associated adverse reactions, and measures to mitigate or prevent adverse effects.
Our research still has several limitations.First, SemRep, as a natural language processing program based on the UMLS, still exhibits shortcomings.Despite the extensive coverage and scale of the UMLS Metathesaurus, it has a relatively limited ability to recognize entities.There are still areas for improvement in processing natural language texts [20].Second, clinical researchers often prefer causal relationships rather than pure correlations; however, our study can only reveal the connections between pieces of information and cannot determine the magnitude and direction of their effects.Third, with the release of new literature, the knowledge graph also needs to be updated promptly, increasing the burden on researchers.Future improvements should focus on automating the mining of literature data to ensure timely updates to the knowledge graph for breast cancer prevention and treatment, thereby alleviating the burden on researchers.

Conclusions
This study successfully constructed a knowledge graph for breast cancer prevention and treatment by integrating relevant literature from the past 5 years and conducting knowledge discovery.Through this knowledge graph, researchers can learn about breast cancer treatment methods, medications, and

Figure 1 .
Figure 1.Relationship between different subtypes and stages of breast cancer and treatment methods.HER2: human epidermal growth factor receptor 2.

Figure 2 .
Figure 2. Relationship between different subtypes and stages of breast cancer and drugs.HER2: human epidermal growth factor receptor 2.

Figure 3 .
Figure 3. Relationship between breast cancer treatment and adverse reactions.

Figure 4 .
Figure 4. Relationship between adverse reactions after breast cancer treatment and preventive measures.

Figure 5 .
Figure 5. Relationship between types of breast cancer, treatments, drugs, and genes.

Table 1 .
Summary of breast cancer subtypes and stages, treatment methods, and treatment drugs.The top 30 subtypes, treatment methods, and treatment drugs with higher frequencies in all data are presented for each group.