MarkerGenie: an NLP-enabled text-mining system for biomedical entity relation extraction

Abstract Motivation Natural language processing (NLP) tasks aim to convert unstructured text data (e.g. articles or dialogues) to structured information. In recent years, we have witnessed fundamental advances of NLP technique, which has been widely used in many applications such as financial text mining, news recommendation and machine translation. However, its application in the biomedical space remains challenging due to a lack of labeled data, ambiguities and inconsistencies of biological terminology. In biomedical marker discovery studies, tools that rely on NLP models to automatically and accurately extract relations of biomedical entities are valuable as they can provide a more thorough survey of all available literature, hence providing a less biased result compared to manual curation. In addition, the fast speed of machine reader helps quickly orient research and development. Results To address the aforementioned needs, we developed automatic training data labeling, rule-based biological terminology cleaning and a more accurate NLP model for binary associative and multi-relation prediction into the MarkerGenie program. We demonstrated the effectiveness of the proposed methods in identifying relations between biomedical entities on various benchmark datasets and case studies. Availability and implementation MarkerGenie is available at https://www.genegeniedx.com/markergenie/. Data for model training and evaluation, term lists of biomedical entities, details of the case studies and all trained models are provided at https://drive.google.com/drive/folders/14RypiIfIr3W_K-mNIAx9BNtObHSZoAyn?usp=sharing. Supplementary information Supplementary data are available at Bioinformatics Advances online.


S1.2 Front and back ends of MarkerGenie
The front-end interface of MarkerGenie is based on VUE, and the back-end framework is based on Django. There is an input text box on the homepage of MarkerGenie, where users can input the disease they want to query. Considering the diversity of user input and the fact that we ignored some general diseases, Mark-erGenie performs fuzzy matching on user input. Fuzzy matching means that once a term in the term lists contains the keyword entered by the user instead of being precisely the same as the keyword, MarkerGenie will think that the word may be what the user wants to search. By doing so, users can even enter a body part to retrieve related diseases.
Taking the assumption that a user enters a disease and wants to find relevant biomarkers from the text of the publication, after the user selects the disease of interest and the type of biomarker (default is all biomarkers), MarkerGenie will output relevant biomarkers as well as the corresponding sentences. This process will be divided into five steps: 1. Check whether the disease and biomarker type have been searched in the search history table. If so, go to step 5.
2. Query the term lists to get all the synonyms of this disease, use elasticsearch to filter corpus.
3. Apply the exact string matching to annotate sentences, select those sentences containing both the disease and biomarkers.
4. Use the trained models to determine whether these sentences are discussing the relations between the disease and biomarkers. Keep associated sentences and write them into the search history table.
5. Find records in the search history table according to the selected disease and biomarker types and sort them in descending order according to the number of sentences discussing their relations.
At last, when users select a specific sentence, MarkerGenie will use elasticsearch to retrieve the article based on the article's id where the sentence is located. So the users can see more information, including the title of the article, abstract, time of publication, and links to Pubmed, Pubmed Center.
Also, some users may want to query the biomarker related to the disease of interest summarized in the table. Then, they only need to click the "table's ranking result" button. MarkerGenie will return biomarkers related to the disease entered by the user in the table and sorted by the tables' counts. Users can view the tables' specific content and the articles by clicking the "Detail" button. Some front pages of MarkerGenie are shown in Figure S1.

S1.3 Integrating users' feedback
In order to improve the users' experience, MarkerGenie also integrates the function of collecting users' feedback. Users can give feedback whether our results are correct or not, these feedback will be used to fine-tune the model to improve the accuracy of predictions and provide users with results that are more in line with their expectations.

S1.4 Time performance
Time performance is another essential indicator that affects usage. The time spent by MarkerGenie is mainly divided into three parts: 1. Filter corpus from all publications based on the queried bioentities' synonyms.
2. Select sentences through exact string matching.
3. Apply the trained models to predict whether the sentences are positive or not.
The time performance of each part is shown in Table S1. In actual deployment, MarkerGenie speeds up by using a more lightweight model (Bi- GRU Cho et al. (2014)) and processing the last two parts in parallel by 20 processes. We did a series of speed tests to calculate the time spent in a search. As a result, a typical search with 2000 relevant abstracts and 200 full texts was finished in less than 30s. For a complete search containing about 100,000 abstracts and 100,000 full texts, the entire process took 30 minutes.

S2 Term lists of biomedical entities
Disease list was built from Disease Ontology (Schriml et al., 2012) -diseases under the branch of "disease" (DOID: 4) in OBO Tree (disease classification based on etiology) were chosen, and the leaf nodes were in-cluded as part of the synonyms of their parent node to limit the total number of disease types. Meanwhile, we removed genertic disease term such as "primary bacterial infectious disease". Next, the "Name" field of each disease was selected to be the unique name, the "DOID" was considered as its unique ID, and all "UMLS CUI" in Xrefs were searched in UMLS Metathesaurus (Bodenreider, 2004) to find the synonyms.
Microbiome list was built from NCBI microbial taxonomy (Schoch et al., 2020) -the taxonomy file "nodes.dmp" and "division.dmp" were used to obtain "division name" and "rank" columns, which were applied to filter for Bacteria, Viruses, Phages on or below family level. The column "tax id" was linked to "names.dmp" from which we obtained the common name of the entity and its synonyms.
Gene list was curated from HGNG (Braschi et al., 2019) by mapping Hugo GeneIDs to UMLS to get the synonyms.
Above term lists were inconsistently formatted and contained ambiguity and errors. We came up a list of rules to improve its accuracy: 1. Ignore those synonyms that are included in other synonyms.
2. If we want to take case into consideration, we add the lowercase format of the terms whose beginning is uppercase to synonyms.
3. Filter out terms with length less than 3.
6. Ignore the authority name of the microbiome cause this name is redundant.
7. Remove any microbes that are specified as "type material".
8. If a synonym ends with "gene", we add the part without "gene" to synonyms. 9. Replace "-" with a space, such as "pharynx-tumors".
10. Filter out terms with a semicolon such as "Carcinoma;bowel;large".
These term lists are updated every six months to include new biomedical entities.

S3 Disease-miRNA association inference
In this section, we demonstrate the application of the output of MarkerGenie: disease-miRNA association inference. This is based on the prior knowledge that miRNAs with similar functions are more likely associated with similar diseases, and diseases with high semantic similarity are more likely associated with similar miRNAs with similar functions (Wang et al., 2010;You et al., 2017).   There are two major databases HMDD (Li et al., 2014) and dbDEMC (Yang et al., 2017) that have been used in prior studies. In HMDD (v2.0), there are a total of 5,430 experiment-supported miRNA-disease association entries that involve 383 diseases and 495 miRNAs. DbDEMC stores differentially expressed miRNAs in cancers detected by high-throughput or low-throughput methods. There are total of 56,647 miRNA-disease associations between 40 diseases and 4,495 miRNA in dbDEMC.
Given these databases of known disease-miRNA associations, a crucial task is to infer new associations (Xuan et al., 2013), and PBMDA (You et al., 2017) is one of the tools to achieve so. In brief, a graph is constructed, where each node represents an miRNA or a disease. An edge exists between each disease and each miRNA whenever an association is present in HMDD. A path-based computational model is built to predict missing edges, representing inferred associations.
Using the same graph based model, instead of building edges using HMDD database, we built edges and assigned edge weight from the output of MarkerGenie. We first obtained all MarkerGenie identified instances of disease-miRNA from the corpus, which contained 299 disease terms and 404 miRNAs overlapping with HMDD. A graph was built taking each miRNA or disease as a node, and the weight between diseases and the weight between miRNAs were set according to the disease semantic similarity score matrix and miRNA functional similarity score matrix provided by PBMDA (You et al., 2017), respectively. The weight between diseases and miRNAs was initialized as follows: if disease i and miRNA j were predicted to be associative by MarkerGenie, the weight(i,j) was set to 1, otherwise 0. The same path score function used by PBMDA was used to update the graph. ( Figure S2A).
The inferred relations for both of the above methods were validated on relations among 31 diseases and 404 miRNAs that overlap with dbDEMC database, which is assumed to be independent of HMDD. An inferred relation is considered a true positive when it is present in the dbDEMC, otherwise, false positive. We adopted the two evaluation methods used in PBMDA. The first method used normalized discounted cumulative gain (NDCG) and mean average precision (MAP) (Radlinski and Craswell, 2010), considering top 10, 20 or all miRNAs. Higher scores indicate better inference. The second method used the ROC curve with respect to different edge weight cutoffs. As shown in Figure S2B and Figure S2C, the evaluation results have shown a comparable performance of MarkerGenie to HMDD. So, MarkerGenie's output can be used as a surrogate for the curated database.